Poorly-balanced socket accepts with Linux 3.2 kernel vs 2.6 kernel

六眼飞鱼酱① 提交于 2019-11-30 06:24:39
Brett

Don't depend on the OS's socket multiple accept to balance load across web server processes.

The Linux kernels behavior differs here from version to version, and we saw a particularly imbalanced behavior with the 3.2 kernel, which appeared to be somewhat more balanced in later versions. e.g. 3.6.

We were operating under the assumption that there should be a way to make Linux do something like round-robin with this, but there were a variety of issues with this, including:

  • Linux kernel 2.6 showed something like round-robin behavior on bare metal (imbalances were about 3-to-1), Linux kernel 3.2 did not (10-to-1 imbalances), and kernel 3.6.10 seemed okay again. We did not attempt to bisect to the actual change.
  • Regardless of the kernel version or build options used, the behavior we saw on a 32-logical-core HVM instance on Amazon Web services was severely weighted toward a single process; there may be issues with Xen socket accept: https://serverfault.com/questions/272483/why-is-tcp-accept-performance-so-bad-under-xen

You can see our tests in great detail on the github issue we were using to correspond with the excellent Node.js team, starting about here: https://github.com/joyent/node/issues/3241#issuecomment-11145233

That conversation ends with the Node.js team indicating that they are seriously considering implementing explicit round-robin in Cluster, and starting an issue for that: https://github.com/joyent/node/issues/4435, and with the Trello team (that's us) going to our fallback plan, which was to use a local HAProxy process to proxy across 16 ports on each server machine, with a 2-worker-process Cluster instance running on each port (for fast failover at the accept level in case of process crash or hang). That plan is working beautifully, with greatly reduced variation in request latency and a lower average latency as well.

There is a lot more to be said here, and I did NOT take the step of mailing the Linux kernel mailing list, as it was unclear if this was really a Xen or a Linux kernel issue, or really just an incorrect expectation of multiple accept behavior on our part.

I'd love to see an answer from an expert on multiple accept, but we're going back to what we can build using components that we understand better. If anyone posts a better answer, I would be delighted to accept it instead of mine.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!