【原创】TCP backlog 在 Linux 中如何起作用

半腔热情 提交于 2019-12-02 04:45:21

How TCP backlog works in Linux

January 1, 2014 (updated March 14, 2015)

When an application puts a socket into LISTEN state using the listen syscall, it needs to specify a backlog for that socket. The backlog is usually described as the limit for the queue of incoming connections.

当应用程序通过 listen 系统调用令一个 socket 进入 LISTEN 状态时,需要为该 socket 指定 backlog 参数;
backlog 参数通常被描述为用于保存 incoming 连接的 queue 的长度;



Because of the 3-way handshake used by TCP, an incoming connection goes through an intermediate state SYN RECEIVED before it reaches the ESTABLISHED state and can be returned by the accept syscall to the application (see the part of the TCP state diagram reproduced above). This means that a TCP/IP stack has two options to implement the backlog queue for a socket in LISTEN state:

由于 TCP 基于三次握手实现,所以一条 incoming 连接会先经历 SYN RECEIVED 中间状态,再进入 ESTABLISHED 状态,从而能够被应用通过 accept 系统调用获取到;
这也就意味着 TCP/IP 协议栈针对处于 LISTEN 状态的 socket 可以有两种方式实现 backlog queue :

The implementation uses a single queue, the size of which is determined by the backlog argument of the listen syscall. When a SYN packet is received, it sends back a SYN/ACK packet and adds the connection to the queue. When the corresponding ACK is received, the connection changes its state to ESTABLISHED and becomes eligible for handover to the application. This means that the queue can contain connections in two different state: SYN RECEIVED and ESTABLISHED. Only connections in the latter state can be returned to the application by the accept syscall.

第一种方式

  • 仅使用单独一条 queue ;queue 的长度由 listen 系统调用的 backlog 参数决定;
  • 在收到 SYN 包后,会回复 SYN,ACK 包,同时将当前连接添加到 queue 中;
  • 当最后的 ACK 被收到时,该连接的状态会变更成 ESTABLISHED 状态,之后应用才能够获取到该连接;
  • 这就意味着,queue 中会维护处于两种状态下的连接:SYN RECEIVED 和 ESTABLISHED ;
  • 而只有处于 ESTABLISHED 状态的连接才能被应用通过 accept 系统调用获取到;

The implementation uses two queues, a SYN queue (or incomplete connection queue) and an accept queue (or complete connection queue). Connections in state SYN RECEIVED are added to the SYN queue and later moved to the accept queue when their state changes to ESTABLISHED, i.e. when the ACK packet in the 3-way handshake is received. As the name implies, the accept call is then implemented simply to consume connections from the accept queue. In this case, the backlog argument of the listen syscall determines the size of the accept queue.

第二种方式

  • 使用两条 queue 维护连接;一条为 SYN queue 或者称作 incomplete connection queue ;一条为 accept queue 或者称作 complete connection queue
  • 处于 SYN RECEIVED 状态的连接会被放入到 SYN queue 中,并在连接状态变为 ESTABLISHED 后,被搬移到 accept queue 中;
  • 上述搬移行为发生在三次握手的最后 ACK 被收到时
  • 在这种实现方式中,accept 系统调用只需简单的从 accept queue 中获取连接即可;而 backlog 参数决定的就是 accept queue 的大小;

Historically, BSD derived TCP implementations use the first approach. That choice implies that when the maximum backlog is reached, the system will no longer send back SYN/ACK packets in response to SYN packets. Usually the TCP implementation will simply drop the SYN packet (instead of responding with a RST packet) so that the client will retry. This is what is described in section 14.5, listen Backlog Queue in W. Richard Stevens’ classic textbook TCP/IP Illustrated, Volume 3.

从历史渊源上看,源于 BSD 的 TCP 实现使用的均为第一种方式;

  • 这种选择,隐式的表明了:一旦 backlog 的上限被达到后,系统将不会在收到 SYN 包后,应答 SYN,ACK 包
  • 通常情况下,TCP 实现会以简单丢弃收到的 SYN 包方式进行处理(而不是采用回复 RST 包的方式),这样客户端侧将会进行 SYN 重发;
  • 这种行为正是 W. Richard Stevens 在 TCP/IP Illustrated, Volume 3 中 section 14.5 描述的内容;

Note that Stevens actually explains that the BSD implementation does use two separate queues, but they behave as a single queue with a fixed maximum size determined by (but not necessary exactly equal to) the backlog argument, i.e. BSD logically behaves as described in option 1:

需要注意的是,Stevens 在解释 BSD 实现时确实提及了两条 queue ,但这两条 queue 表现的如同单独一条 queue 一样,具有由 backlog 参数决定的(并非等于 backlog 的值)固定大小(也可以说成,backlog 所指的长度值为两个队列之和);
也就是说,BSD 实现从逻辑上来看,表现的就像采用了上述第一种实现一样;

The queue limit applies to the sum of […] the number of entries on the incomplete connection queue […] and […] the number of entries on the completed connection queue […].

queue 的大小限制取决于 incomplete connection queue 中的 entry 数目 + completed connection queue 中的 entry 数目

On Linux, things are different, as mentioned in the man page of the listen syscall:

在 Linux 系统中,上述结论与实际情况存在着一些差异,正如在 listen 系统调用的 man 手册描述的那样:

The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog.
This means that current Linux versions use the second option with two distinct queues: a SYN queue with a size specified by a system wide setting and an accept queue with a size specified by the application.

Linux 2.2 开始,TCP socket 的 backlog 参数行为发生了变化:

  • 现在,backlog 参数用于针对保存 completely established sockets 的 queue 长度进行控制
  • 用于保存 incomplete sockets 的 queue 的长度则通过 /proc/sys/net/ipv4/tcp_max_syn_backlog 进行控制;
  • 这就意味着,当前 Linux 版本实际使用的是上述第二种实现方式,即采用了两条 queue 维护连接;
  • SYN queue 的大小通过系统范围有效的参数进行设置;
  • accept queue 的大小通过应用指定的 backlog 值进行设置;

The interesting question is now how such an implementation behaves if the accept queue is full and a connection needs to be moved from the SYN queue to the accept queue, i.e. when the ACK packet of the 3-way handshake is received. This case is handled by the tcp_check_req function in net/ipv4/tcp_minisocks.c. The relevant code reads:

一个有趣的问题是:在采用两条 queue 维护连接的实现中,如果 accept queue 已经满了,但却有连接需要从 SYN queue 中搬移到 accept queue 中(即收到了三次握手的最后 ACK),该如何处理?
这种情况已经在 net/ipv4/tcp_minisocks.c 中的 tcp_check_req 函数里进行了处理,相关代码如下:

child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
        if (child == NULL)
                goto listen_overflow;

For IPv4, the first line of code will actually call tcp_v4_syn_recv_sock in net/ipv4/tcp_ipv4.c, which contains the following code:

对于 IPv4 来说,上述代码的第一行实际上调用的是 net/ipv4/tcp_ipv4.c 中的 tcp_v4_syn_recv_sock 函数,其中包含了如下代码片段:

if (sk_acceptq_is_full(sk))
        goto exit_overflow;
...
exit_overflow:
    NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
exit:
    NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
    dst_release(dst);
    return NULL;
...

We see here the check for the accept queue. The code after the exit_overflow label will perform some cleanup, update the ListenOverflows and ListenDrops statistics in /proc/net/netstat and then return NULL. This will trigger the execution of the listen_overflow code in tcp_check_req:

这里我们可以看到,代码针对 accept queue 是否已满进行了检测;
位于 exit_overflow 标签之后的代码会先进行一些清理操作,再更新 /proc/net/netstat 中 ListenOverflows 和 ListenDrops 的统计信息,最后返回 NULL ;

这将触发 tcp_check_req 函数中 listen_overflow 代码的执行:

listen_overflow:
        if (!sysctl_tcp_abort_on_overflow) {
                inet_rsk(req)->acked = 1;
                return NULL;
        }

This means that unless /proc/sys/net/ipv4/tcp_abort_on_overflow is set to 1 (in which case the code right after the code shown above will send a RST packet), the implementation basically does… nothing!

从代码中可以看出,除非设置了 /proc/sys/net/ipv4/tcp_abort_on_overflow = 1 (设置为 1 则会发送 RST 包),否则上述实现代码基本上啥也没做;

To summarize, if the TCP implementation in Linux receives the ACK packet of the 3-way handshake and the accept queue is full, it will basically ignore that packet. At first, this sounds strange, but remember that there is a timer associated with the SYN RECEIVED state: if the ACK packet is not received (or if it is ignored, as in the case considered here), then the TCP implementation will resend the SYN/ACK packet (with a certain number of retries specified by /proc/sys/net/ipv4/tcp_synack_retries and using an exponential backoff algorithm).

总结一下:如果 Linux 的 TCP 实现在接收到了三次握手的最后 ACK 包时 accept queue 已满,则其行为基本上是忽略该包;

也许你会对这种实现策略感到奇怪,但需要记住,还有一个和 SYN RECEIVED 状态相关的定时器的存在:如果 ACK 包压根就没收到(或者收到后被忽略了),那么 TCP 实现将会重发 SYN,ACK 包(重发次数取决于 /proc/sys/net/ipv4/tcp_synack_retries 的设置,并实现了指数退让算法);

This can be seen in the following packet trace for a client attempting to connect (and send data) to a socket that has reached its maximum backlog:

这种行为可以通过下面的抓包信息看出来:客户端向一个已经达到 backlog 上限的 socket 进行连接尝试(和数据发送):

0.000  127.0.0.1 -> 127.0.0.1  TCP 74 53302 > 9999 [SYN] Seq=0 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 66 53302 > 9999 [ACK] Seq=1 Ack=1 Len=0
  0.000  127.0.0.1 -> 127.0.0.1  TCP 71 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  0.207  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  0.623  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  1.199  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  1.199  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 6#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
  1.455  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  3.123  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  3.399  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  3.399  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 10#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
  6.459  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
  7.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
  7.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 13#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 13.131  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
 15.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
 15.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 16#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 26.491  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
 31.599  127.0.0.1 -> 127.0.0.1  TCP 74 9999 > 53302 [SYN, ACK] Seq=0 Ack=1 Len=0
 31.599  127.0.0.1 -> 127.0.0.1  TCP 66 [TCP Dup ACK 19#1] 53302 > 9999 [ACK] Seq=6 Ack=1 Len=0
 53.179  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
106.491  127.0.0.1 -> 127.0.0.1  TCP 71 [TCP Retransmission] 53302 > 9999 [PSH, ACK] Seq=1 Ack=1 Len=5
106.491  127.0.0.1 -> 127.0.0.1  TCP 54 9999 > 53302 [RST] Seq=1 Len=0

Since the TCP implementation on the client side gets multiple SYN/ACK packets, it will assume that the ACK packet was lost and resend it (see the lines with TCP Dup ACK in the above trace). If the application on the server side reduces the backlog (i.e. consumes an entry from the accept queue) before the maximum number of SYN/ACK retries has been reached, then the TCP implementation will eventually process one of the duplicate ACKs, transition the state of the connection from SYN RECEIVED to ESTABLISHED and add it to the accept queue. Otherwise, the client will eventually get a RST packet (as in the sample shown above).

因为客户端侧的 TCP 实现会收到多次 SYN,ACK 包,所以在其重复收到 SYN,ACK 时,会认为之前发送的 ACK 已经丢失了,并进行重发(可以从上面的 TCP Dup ACK 看出来);

如果服务器侧的应用程序在 SYN/ACK 重发尚未结束前(即未达到最大次数),成功减少了当前的 backlog 值(这里不是指调整 listen 中的参数值,而是指从 accept queue 中取走了至少一个 entry),此时服务器侧的 TCP 实现将能够成功处理一个 duplicate ACKs ,进而将连接的状态从 SYN RECEIVED 转变成 ESTABLISHED ,并将该连接添加到 accept queue 中;

否则,(若 accept queue 始终处于满的状态)客户端侧最终会收到一个 RST 包作为终止(正如上面的示例显示)

The packet trace also shows another interesting aspect of this behavior. From the point of view of the client, the connection will be in state ESTABLISHED after reception of the first SYN/ACK. If it sends data (without waiting for data from the server first), then that data will be retransmitted as well. Fortunately TCP slow-start should limit the number of segments sent during this phase.

上述抓包信息还显示出了另外一个有趣的事实:从客户端的角度来看,在首次收到 SYN,ACK 包后,连接就已经处于 ESTABLISHED 状态了;如果客户端此时发送了数据(而不是需要先等来自服务器侧的数据),那么该数据同样会被重发

幸运的是,TCP 的慢启动策略将会限制数据重发阶段的 segment 数量;

On the other hand, if the client first waits for data from the server and the server never reduces the backlog, then the end result is that on the client side, the connection is in state ESTABLISHED, while on the server side, the connection is considered CLOSED. This means that we end up with a half-open connection!

而另一方面,如果客户端需要先等待来自服务器侧的数据,而服务器一直没有机会减低当前 backlog 值,那么最终的结果就是,在客户端侧,该连接的状态显示为 ESTABLISHED ,而在服务器侧,该连接已经被认为处于 CLOSED 状态了;这就意味着,我们实际上是以半打开连接的形式终止的;

There is one other aspect that we didn’t discuss yet. The quote from the listen man page suggests that every SYN packet would result in the addition of a connection to the SYN queue (unless that queue is full). That is not exactly how things work. The reason is the following code in the tcp_v4_conn_request function (which does the processing of SYN packets) in net/ipv4/tcp_ipv4.c:

还有另外一个方面我们尚未讨论到:在 listen 的 man 手册中的提到,每一个 SYN 包都将会导致 SYN queue 中连接数的增加(除非 SYN queue 本身已经满了)而实际上的表现与这段描述并不一致;具体原因在 net/ipv4/tcp_ipv4.c 中的 tcp_v4_conn_request 函数实现(负责连接建立的 SYN 包处理)中可以看到:

/* Accept backlog is full. If we have already queued enough
         * of warm entries in syn queue, drop request. It is better than
         * clogging syn queue with openreqs with exponentially increasing
         * timeout.
         */
        if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
                goto drop;
        }
...
drop:
     return 0;
...

What this means is that if the accept queue is full, then the kernel will impose a limit on the rate at which SYN packets are accepted. If too many SYN packets are received, some of them will be dropped. In this case, it is up to the client to retry sending the SYN packet and we end up with the same behavior as in BSD derived implementations.

上述代码的意思是,如果 accept queue 已满,那么内核将会针对 SYN 包的接收情况施加一个速率限制 :即 如果接收到了过多的 SYN 包,则会直接丢弃其中的一些;在这种情况下,将取决于客户端侧的 SYN 包重传解决相关问题;此时我们就退化成和 BSD 实现相同的处理方式了;

To conclude, let’s try to see why the design choice made by Linux would be superior to the traditional BSD implementation. Stevens makes the following interesting point:

最后的最后,让我们看看为什么 Linux 的设计选择要比传统的 BSD 实现更加高级;Stevens 自己给出了如下观点:

The backlog can be reached if the completed connection queue fills (i.e., the server process or the server host is so busy that the process cannot call accept fast enough to take the completed entries off the queue) or if the incomplete connection queue fills. The latter is the problem that HTTP servers face, when the round-trip time between the client and server is long, compared to the arrival rate of new connection requests, because a new SYN occupies an entry on this queue for one round-trip time. […]

当 completed connection queue 或 incomplete connection queue 被不断添加连接的情况下,backlog 上限很容易被达到;

completed connection queue 满的可能情况:服务进程本身或者服务器宿主处于忙碌状态,导致进程调用 accept 获取 completed entries 的速度不够快;

incomplete connection queue 满的情况是 HTTP 服务器经常会面对的:与新连接请求(即 SYN 包)到达速率相比,若客户端和服务器之间的 round-trip time(即 RTT)过长(意思就是 SYN->SYN,ACK->ACK 的交互时间过长),则会发生此情况;

The completed connection queue is almost always empty because when an entry is placed on this queue, the server’s call to accept returns, and the server takes the completed connection off the queue.

completed connection queue 几乎总是空的,因为只要有 entry 被放入该 queue 中,业务中调用的 accept 就会返回,同时取走该 queue 中的连接;

The solution suggested by Stevens is simply to increase the backlog. The problem with this is that it assumes that an application is expected to tune the backlog not only taking into account how it intents to process newly established incoming connections, but also in function of traffic characteristics such as the round-trip time. The implementation in Linux effectively separates these two concerns: the application is only responsible for tuning the backlog such that it can call accept fast enough to avoid filling the accept queue); a system administrator can then tune /proc/sys/net/ipv4/tcp_max_syn_backlog based on traffic characteristics.

Stevens 建议的解决办法是:增大 backlog

这种简单的解决办法是基于这样一种假设:应用程序希望调节 backlog 值的原因,不仅考虑到针对新 incoming 连接处理,还考虑了网络通信的运行特性(例如基于 RTT 的考虑);

Linux 中的实现实际上针对两种情况进行了拆分处理:

  • 应用程序仅需关注 backlog 的调节(backlog 的上限为 somaxconn 的值,本文未提及),以便其能够足够快速的调用 accept 以避免 accept queue 被填满;
  • 系统管理员可以通过基于网络通信特性的分析,进行 /proc/sys/net/ipv4/tcp_max_syn_backlog 的调整(调整该参数的效果和 linux 内核版本有关,本文未提价);

 

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!