Sending data with PACKET_MMAP and PACKET_TX_RING is slower than “normal” (without)

问题

I’m writing a traffic generator in C using the PACKET_MMAP socket option to create a ring buffer to send data over a raw socket. The ring buffer is filled with Ethernet frames to send and sendto is called. The entire contents of the ring buffer is sent over the socket which should give higher performance than having a buffer in memory and calling sendto repeatedly for every frame in the buffer that needs sending.

When not using PACKET_MMAP, upon calling sendto a single frame is copied from the buffer in the user-land memory to an SK buf in kernel memory, then the kernel must copy the packet to memory accessed by the NIC for DMA and signal the NIC to DMA the frame into it's own hardware buffers and queue it for transmission. When using the PACKET_MMAP socket option mmapped memory is allocated by the application and linked to the raw socket. The application places packets into the mmapped buffer, calls sendto and instead of the Kernel having to copy the packets into an SK buf it can read them from the mmapped buffer directly. Also "blocks" of packets can be read from the ring buffer instead of individual packets/frames. So the performance increase is one sys-call to copy multiple frames and one less copy action for each frame to get it into the NIC hardware buffers.

When I am comparing the performance of a socket using PACKET_MMAP to a “normal” socket (a char buffer with a single packet in it) there is no performance benefit at all. Why is this? When using PACKET_MMAP in Tx mode, only one frame can be put into each ring block (rather than multiple frames per ring block as with Rx mode) however I am creating 256 blocks so we should be sending 256 frames in a single sendto call right?

Performance with PACKET_MMAP, main() calls packet_tx_mmap():

bensley@ubuntu-laptop:~/C/etherate10+$ sudo taskset -c 1 ./etherate_mt -I 1
Using inteface lo (1)
Running in Tx mode
1. Rx Gbps 0.00 (0) pps 0   Tx Gbps 17.65 (2206128128) pps 1457152
2. Rx Gbps 0.00 (0) pps 0   Tx Gbps 19.08 (2385579520) pps 1575680
3. Rx Gbps 0.00 (0) pps 0   Tx Gbps 19.28 (2409609728) pps 1591552
4. Rx Gbps 0.00 (0) pps 0   Tx Gbps 19.31 (2414260736) pps 1594624
5. Rx Gbps 0.00 (0) pps 0   Tx Gbps 19.30 (2411935232) pps 1593088

Performance without PACKET_MMAP, main() calls packet_tx():

bensley@ubuntu-laptop:~/C/etherate10+$ sudo taskset -c 1 ./etherate_mt -I 1
Using inteface lo (1)
Running in Tx mode
1. Rx Gbps 0.00 (0) pps 0   Tx Gbps 18.44 (2305001412) pps 1522458
2. Rx Gbps 0.00 (0) pps 0   Tx Gbps 20.30 (2537520018) pps 1676037
3. Rx Gbps 0.00 (0) pps 0   Tx Gbps 20.29 (2535744096) pps 1674864
4. Rx Gbps 0.00 (0) pps 0   Tx Gbps 20.26 (2533014354) pps 1673061
5. Rx Gbps 0.00 (0) pps 0   Tx Gbps 20.32 (2539476106) pps 1677329

The packet_tx() function is slightly faster than the packet_tx_mmap() function it seems but it is also slightly shorter so I think that minimal performance increase is simply the slightly fewer lines of code of present in packet_tx. So it looks to me like both the functions have practically the same performance, why is that? Why isn't PACKET_MMAP much faster, as I understand it there should be far less sys-calls and copies?

void *packet_tx_mmap(void* thd_opt_p) {

    struct thd_opt *thd_opt = thd_opt_p;
    int32_t sock_fd = setup_socket_mmap(thd_opt_p);
    if (sock_fd == EXIT_FAILURE) exit(EXIT_FAILURE);

    struct tpacket2_hdr *hdr;
    uint8_t *data;
    int32_t send_ret = 0;
    uint16_t i;

    while(1) {

        for (i = 0; i < thd_opt->tpacket_req.tp_frame_nr; i += 1) {

            hdr = (void*)(thd_opt->mmap_buf + (thd_opt->tpacket_req.tp_frame_size * i));
            data = (uint8_t*)(hdr + TPACKET_ALIGN(TPACKET2_HDRLEN));

            memcpy(data, thd_opt->tx_buffer, thd_opt->frame_size);
            hdr->tp_len = thd_opt->frame_size;
            hdr->tp_status = TP_STATUS_SEND_REQUEST;

        }

        send_ret = sendto(sock_fd, NULL, 0, 0, NULL, 0);
        if (send_ret == -1) {
            perror("sendto error");
            exit(EXIT_FAILURE);
        }

        thd_opt->tx_pkts  += thd_opt->tpacket_req.tp_frame_nr;
        thd_opt->tx_bytes += send_ret;

    }

    return NULL;

}

Note that the function below calls setup_socket() and not setup_socket_mmap():

void *packet_tx(void* thd_opt_p) {

    struct thd_opt *thd_opt = thd_opt_p;

    int32_t sock_fd = setup_socket(thd_opt_p); 

    if (sock_fd == EXIT_FAILURE) {
        printf("Can't create socket!\n");
        exit(EXIT_FAILURE);
    }

    while(1) {

        thd_opt->tx_bytes += sendto(sock_fd, thd_opt->tx_buffer,
                                    thd_opt->frame_size, 0,
                                    (struct sockaddr*)&thd_opt->bind_addr,
                                    sizeof(thd_opt->bind_addr));
        thd_opt->tx_pkts += 1;

    }

}

The only difference in the socket setup functions is pasted below, but essentially its the requirements to set up a SOCKET_RX_RING or SOCKET_TX_RING:

// Set the TPACKET version, v2 for Tx and v3 for Rx
// (v2 supports packet level send(), v3 supports block level read())
int32_t sock_pkt_ver = -1;

if(thd_opt->sk_mode == SKT_TX) {
    static const int32_t sock_ver = TPACKET_V2;
    sock_pkt_ver = setsockopt(sock_fd, SOL_PACKET, PACKET_VERSION, &sock_ver, sizeof(sock_ver));
} else {
    static const int32_t sock_ver = TPACKET_V3;
    sock_pkt_ver = setsockopt(sock_fd, SOL_PACKET, PACKET_VERSION, &sock_ver, sizeof(sock_ver));
}

if (sock_pkt_ver < 0) {
    perror("Can't set socket packet version");
    return EXIT_FAILURE;
}


memset(&thd_opt->tpacket_req, 0, sizeof(struct tpacket_req));
memset(&thd_opt->tpacket_req3, 0, sizeof(struct tpacket_req3));

//thd_opt->block_sz = 4096; // These are set else where
//thd_opt->block_nr = 256;
//thd_opt->block_frame_sz = 4096;

int32_t sock_mmap_ring = -1;
if (thd_opt->sk_mode == SKT_TX) {

    thd_opt->tpacket_req.tp_block_size = thd_opt->block_sz;
    thd_opt->tpacket_req.tp_frame_size = thd_opt->block_sz;
    thd_opt->tpacket_req.tp_block_nr = thd_opt->block_nr;
    // Allocate per-frame blocks in Tx mode (TPACKET_V2)
    thd_opt->tpacket_req.tp_frame_nr = thd_opt->block_nr;

    sock_mmap_ring = setsockopt(sock_fd, SOL_PACKET , PACKET_TX_RING , (void*)&thd_opt->tpacket_req , sizeof(struct tpacket_req));

} else {

    thd_opt->tpacket_req3.tp_block_size = thd_opt->block_sz;
    thd_opt->tpacket_req3.tp_frame_size = thd_opt->block_frame_sz;
    thd_opt->tpacket_req3.tp_block_nr = thd_opt->block_nr;
    thd_opt->tpacket_req3.tp_frame_nr = (thd_opt->block_sz * thd_opt->block_nr) / thd_opt->block_frame_sz;
    thd_opt->tpacket_req3.tp_retire_blk_tov   = 1;
    thd_opt->tpacket_req3.tp_feature_req_word = 0;

    sock_mmap_ring = setsockopt(sock_fd, SOL_PACKET , PACKET_RX_RING , (void*)&thd_opt->tpacket_req3 , sizeof(thd_opt->tpacket_req3));
}

if (sock_mmap_ring == -1) {
    perror("Can't enable Tx/Rx ring for socket");
    return EXIT_FAILURE;
}


thd_opt->mmap_buf = NULL;
thd_opt->rd = NULL;

if (thd_opt->sk_mode == SKT_TX) {

    thd_opt->mmap_buf = mmap(NULL, (thd_opt->block_sz * thd_opt->block_nr), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock_fd, 0);

    if (thd_opt->mmap_buf == MAP_FAILED) {
        perror("mmap failed");
        return EXIT_FAILURE;
    }


} else {

    thd_opt->mmap_buf = mmap(NULL, (thd_opt->block_sz * thd_opt->block_nr), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock_fd, 0);

    if (thd_opt->mmap_buf == MAP_FAILED) {
        perror("mmap failed");
        return EXIT_FAILURE;
    }

    // Per bock rings in Rx mode (TPACKET_V3)
    thd_opt->rd = (struct iovec*)calloc(thd_opt->tpacket_req3.tp_block_nr * sizeof(struct iovec), 1);

    for (uint16_t i = 0; i < thd_opt->tpacket_req3.tp_block_nr; ++i) {
        thd_opt->rd[i].iov_base = thd_opt->mmap_buf + (i * thd_opt->tpacket_req3.tp_block_size);
        thd_opt->rd[i].iov_len  = thd_opt->tpacket_req3.tp_block_size;
    }


}

Update 1: Result against physical interface(s) It was mentioned that one reason I might not be seeing a performance difference when using PACKET_MMAP was because I was sending traffic to the loopback interface (which, for one thing, doesn't have a QDISC). Since running either of the packet_tx_mmap() or packet_tx() routines can generate more than 10Gbps and I only have 10Gbps interfaces at my disposal I have bonded two together and these are the results, which show pretty much the same as above, there is minimal speed difference between the two functions:

packet_tx() to 20G bond0

1 thread: Average 10.77Gbps~ / 889kfps~
2 threads: Average 19.19Gbps~ / 1.58Mfps~
3 threads: Average 19.67Gbps~ / 1.62Mfps~ (this is as fast as the bond will go)

packet_tx_mmap() to 20G bond0:

1 thread: Average 11.08Gbps~ / 913kfps~
2 threads: Average 19.0Gbps~ / 1.57Mfps~
3 threads: Average 19.66Gbps~ / 1.62Mfps~ (this is as fast as the bond will go)

This was with frames 1514 bytes in size (to keep it the same as the original loopback tests above).

In all of the above tests the number of soft IRQs was roughly the same (measured using this script). With one thread running packet_tx() there was circa 40k interrupts per second on a CPU core. With 2 and 3 threads running there 40k on 2 and 3 core respectively. The results when using packet_tx_mmap() where the same. Circa 40k soft IRQs for a single thread on one CPU core. 40k per core when running 2 and 3 threads.

Update 2: Full Source Code

I have uploaded the full source code now, I'm still writing this application so it probably has plenty of flaws but they are outside the scope of this question: https://github.com/jwbensley/EtherateMT

回答1:

Many interfaces to the linux kernel are not well documented. Or even if they seem well documented, they can be pretty complex and that can make it hard to understanding what the functional or, often even harder, nonfunctional properties of the interface are.

For this reason, my advice to anyone wanting a solid understanding of kernel APIs or needing to create high performance applications using kernel APIs needs to be able to engage with kernel code to be successful.

In this case the questioner wants to understand the performance characteristics of sending raw frames though a shared memory interface (packet mmap) to the kernel.

The linux documentation is here. It has a stale link to a "how to," which can now be found here and includes a copy of packet_mmap.c (I have a slightly different version available here.

The documentation is largely geared towards reading, which is the typical use case for using packet mmap: efficiently reading raw frames from an interface for, e.g. efficiently obtaining a packet capture from a high speed interface with little or no loss.

The OP however is interested in high performance writing, which is a much less common use case, but potentially useful for a traffic generator/simulator which appears to be what the OP wants to do with it. Thankfully, the "how to" is all about writing frames.

Even so, there is very little information provided about how this actually works, and nothing of obvious help to answer the OPs question about why using packet mmap doesn't seem to be faster than not using it and instead sending one frame at a time.

Thankfully the kernel source is open source and well indexed, so we can turn to the source to help us get the answer to the question.

In order to find the relevant kernel code there are several keywords you could search for, but PACKET_TX_RING stands out as a socket option unique to this feature. Searching on the interwebs for "PACKET_TX_RING linux cross reference" turns up a small number of references, including af_packet.c, which with a little inspection appears to be the implementation of all the AF_PACKET functionality, including packet mmap.

Looking through af_packet.c, it appears that the core of the work for transmitting with packet mmap takes place in tpacket_snd(). But is this correct? How can we tell if this has anything to do with what we think it does?

A very powerful tool for getting information like this out of the kernel is SystemTap. (Using this requires installing debugging symbols for your kernel. I happen to be using Ubuntu, and this is a recipe for getting SystemTap working on Ubuntu.)

Once you have SystemTap working, you can use SystemTap in conjuction with packet_mmap.c to see if tpacket_snd() is even invoked by installing a probe on the kernel function tpacket_snd, and then running packet_mmap to send a frame via a shared TX ring:

$ sudo stap -e 'probe kernel.function("tpacket_snd") { printf("W00T!\n"); }' &
[1] 19961
$ sudo ./packet_mmap -c 1 eth0
[...]
STARTING TEST:
data offset = 32 bytes
start fill() thread
send 1 packets (+150 bytes)
end of task fill()
Loop until queue empty (0)
END (number of error:0)
W00T!
W00T!

W00T! We are on to something; tpacket_snd is actually being called. But our victory will be short lived. If we continue to try to get more information out of a stock kernel build, SystemTap will complain that it can't find the variables we want to inspect and function arguments will print out with values as ? or ERROR. This is because the kernel is compiled with optimization and all of the functionality for AF_PACKET is defined in the single translation unit af_packet.c; many of the functions are inlined by the compiler, effectively losing local variables and arguments.

In order to pry more information out of af_packet.c, we are going to have to build a version of the kernel where af_packet.c is built without optimization. Look here for some guidance. I'll wait.

OK, hopefully that wasn't too hard and you have successfully booted a kernel that SystemTap can get lots of good information from. Keep in mind that this kernel version is just to help us figure out how packet mmap is working. We can't get any direct performance information from this kernel because af_packet.c was build without optimization. If it turns out that we need to get information on how the optimized version would behave, we can build another kernel with af_packet.c compiled with optimization, but with some instrumentation code added that exposes information via variables that won't get optimized out so that SystemTap can see them.

So let's use it to get some information. Take a look at status.stp:

# This is specific to net/packet/af_packet.c 3.13.0-116

function print_ts() {
  ts = gettimeofday_us();
  printf("[%10d.%06d] ", ts/1000000, ts%1000000);
}

#  325 static void __packet_set_status(struct packet_sock *po, void *frame, int status)
#  326 {
#  327  union tpacket_uhdr h;
#  328 
#  329  h.raw = frame;
#  330  switch (po->tp_version) {
#  331  case TPACKET_V1:
#  332      h.h1->tp_status = status;
#  333      flush_dcache_page(pgv_to_page(&h.h1->tp_status));
#  334      break;
#  335  case TPACKET_V2:
#  336      h.h2->tp_status = status;
#  337      flush_dcache_page(pgv_to_page(&h.h2->tp_status));
#  338      break;
#  339  case TPACKET_V3:
#  340  default:
#  341      WARN(1, "TPACKET version not supported.\n");
#  342      BUG();
#  343  }
#  344 
#  345  smp_wmb();
#  346 }

probe kernel.statement("__packet_set_status@net/packet/af_packet.c:334") {
  print_ts();
  printf("SET(V1): %d (0x%.16x)\n", $status, $frame);
}

probe kernel.statement("__packet_set_status@net/packet/af_packet.c:338") {
  print_ts();
  printf("SET(V2): %d\n", $status);
}

#  348 static int __packet_get_status(struct packet_sock *po, void *frame)
#  349 {
#  350  union tpacket_uhdr h;
#  351 
#  352  smp_rmb();
#  353 
#  354  h.raw = frame;
#  355  switch (po->tp_version) {
#  356  case TPACKET_V1:
#  357      flush_dcache_page(pgv_to_page(&h.h1->tp_status));
#  358      return h.h1->tp_status;
#  359  case TPACKET_V2:
#  360      flush_dcache_page(pgv_to_page(&h.h2->tp_status));
#  361      return h.h2->tp_status;
#  362  case TPACKET_V3:
#  363  default:
#  364      WARN(1, "TPACKET version not supported.\n");
#  365      BUG();
#  366      return 0;
#  367  }
#  368 }

probe kernel.statement("__packet_get_status@net/packet/af_packet.c:358") { 
  print_ts();
  printf("GET(V1): %d (0x%.16x)\n", $h->h1->tp_status, $frame); 
}

probe kernel.statement("__packet_get_status@net/packet/af_packet.c:361") { 
  print_ts();
  printf("GET(V2): %d\n", $h->h2->tp_status); 
}

# 2088 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
# 2089 {
# [...]
# 2136  do {
# 2137      ph = packet_current_frame(po, &po->tx_ring,
# 2138              TP_STATUS_SEND_REQUEST);
# 2139 
# 2140      if (unlikely(ph == NULL)) {
# 2141          schedule();
# 2142          continue;
# 2143      }
# 2144 
# 2145      status = TP_STATUS_SEND_REQUEST;
# 2146      hlen = LL_RESERVED_SPACE(dev);
# 2147      tlen = dev->needed_tailroom;
# 2148      skb = sock_alloc_send_skb(&po->sk,
# 2149              hlen + tlen + sizeof(struct sockaddr_ll),
# 2150              0, &err);
# 2151 
# 2152      if (unlikely(skb == NULL))
# 2153          goto out_status;
# 2154 
# 2155      tp_len = tpacket_fill_skb(po, skb, ph, dev, size_max, proto,
# 2156                    addr, hlen);
# [...]
# 2176      skb->destructor = tpacket_destruct_skb;
# 2177      __packet_set_status(po, ph, TP_STATUS_SENDING);
# 2178      atomic_inc(&po->tx_ring.pending);
# 2179 
# 2180      status = TP_STATUS_SEND_REQUEST;
# 2181      err = dev_queue_xmit(skb);
# 2182      if (unlikely(err > 0)) {
# [...]
# 2195      }
# 2196      packet_increment_head(&po->tx_ring);
# 2197      len_sum += tp_len;
# 2198  } while (likely((ph != NULL) ||
# 2199          ((!(msg->msg_flags & MSG_DONTWAIT)) &&
# 2200           (atomic_read(&po->tx_ring.pending))))
# 2201      );
# 2202 
# [...]
# 2213  return err;
# 2214 }

probe kernel.function("tpacket_snd") {
  print_ts();
  printf("tpacket_snd: args(%s)\n", $$parms);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2140") {
  print_ts();
  printf("tpacket_snd:2140: current frame ph = 0x%.16x\n", $ph);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2141") {
  print_ts();
  printf("tpacket_snd:2141: (ph==NULL) --> schedule()\n");
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2142") {
  print_ts();
  printf("tpacket_snd:2142: flags 0x%x, pending %d\n", 
     $msg->msg_flags, $po->tx_ring->pending->counter);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2197") {
  print_ts();
  printf("tpacket_snd:2197: flags 0x%x, pending %d\n", 
     $msg->msg_flags, $po->tx_ring->pending->counter);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2213") {
  print_ts();
  printf("tpacket_snd: return(%d)\n", $err);
}

# 1946 static void tpacket_destruct_skb(struct sk_buff *skb)
# 1947 {
# 1948  struct packet_sock *po = pkt_sk(skb->sk);
# 1949  void *ph;
# 1950 
# 1951  if (likely(po->tx_ring.pg_vec)) {
# 1952      __u32 ts;
# 1953 
# 1954      ph = skb_shinfo(skb)->destructor_arg;
# 1955      BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
# 1956      atomic_dec(&po->tx_ring.pending);
# 1957 
# 1958      ts = __packet_set_timestamp(po, ph, skb);
# 1959      __packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts);
# 1960  }
# 1961 
# 1962  sock_wfree(skb);
# 1963 }

probe kernel.statement("tpacket_destruct_skb@net/packet/af_packet.c:1959") {
  print_ts();
  printf("tpacket_destruct_skb:1959: ph = 0x%.16x, ts = 0x%x, pending %d\n",
     $ph, $ts, $po->tx_ring->pending->counter);
}

This defines a function (print_ts to print out unix epoch time with microsecond resolution) and a number of probes.

First we define probes to print out information when packets in the tx_ring have their status set or read. Next we define probes for the call and return of tpacket_snd and at points within the do {...} while (...) loop processing the packets in the tx_ring. Finally we add a probe to the skb destructor.

We can start the SystemTap script with sudo stap status.stp. Then run sudo packet_mmap -c 2 <interface> to send 2 frames through the interface. Here is the output I got from the SystemTap script:

[1492581245.839850] tpacket_snd: args(po=0xffff88016720ee38 msg=0x14)
[1492581245.839865] GET(V1): 1 (0xffff880241202000)
[1492581245.839873] tpacket_snd:2140: current frame ph = 0xffff880241202000
[1492581245.839887] SET(V1): 2 (0xffff880241202000)
[1492581245.839918] tpacket_snd:2197: flags 0x40, pending 1
[1492581245.839923] GET(V1): 1 (0xffff88013499c000)
[1492581245.839929] tpacket_snd:2140: current frame ph = 0xffff88013499c000
[1492581245.839935] SET(V1): 2 (0xffff88013499c000)
[1492581245.839946] tpacket_snd:2197: flags 0x40, pending 2
[1492581245.839951] GET(V1): 0 (0xffff88013499e000)
[1492581245.839957] tpacket_snd:2140: current frame ph = 0x0000000000000000
[1492581245.839961] tpacket_snd:2141: (ph==NULL) --> schedule()
[1492581245.839977] tpacket_snd:2142: flags 0x40, pending 2
[1492581245.839984] tpacket_snd: return(300)
[1492581245.840077] tpacket_snd: args(po=0x0 msg=0x14)
[1492581245.840089] GET(V1): 0 (0xffff88013499e000)
[1492581245.840098] tpacket_snd:2140: current frame ph = 0x0000000000000000
[1492581245.840093] tpacket_destruct_skb:1959: ph = 0xffff880241202000, ts = 0x0, pending 1
[1492581245.840102] tpacket_snd:2141: (ph==NULL) --> schedule()
[1492581245.840104] SET(V1): 0 (0xffff880241202000)
[1492581245.840112] tpacket_snd:2142: flags 0x40, pending 1
[1492581245.840116] tpacket_destruct_skb:1959: ph = 0xffff88013499c000, ts = 0x0, pending 0
[1492581245.840119] tpacket_snd: return(0)
[1492581245.840123] SET(V1): 0 (0xffff88013499c000)

And here is the network capture:

There is a lot of useful information in the SystemTap output. We can see tpacket_snd get the status of the first frame in the ring (TP_STATUS_SEND_REQUEST is 1) and then set it to TP_STATUS_SENDING (2). It does the same with the second. The next frame has status TP_STATUS_AVAILABLE (0), which is not a send request, so it calls schedule() to yield, and continues the loop. Since there are no more frames to send (ph==NULL) and non-blocking has been requested (msg->msg_flags ==MSG_DONTWAIT) the do {...} while (...) loop terminates, and tpacket_snd returns 300, the number of bytes queued for transmission.

Next, packet_mmap calls sendto again (via the "loop until queue empty" code), but there is no more data to send in the tx ring, and non-blocking is requested, so it immediately returns 0, as no data has been queued. Note that the frame it checked the status of is the same frame it checked last in the previous call --- it did not start with the first frame in the tx ring, it checked the head (which is not available in userland).

Asynchronously, the destructor is called, first on the first frame, setting the status of the frame to TP_STATUS_AVAILABLE and decrementing the pending count, and then on the second frame. Note that if non-blocking was not requested, the test at the end of the do {...} while (...) loop will wait until all of the pending packets have been transferred to the NIC (assuming it supports scattered data) before returning. You can watch this by running packet_mmap with the -t option for "threaded" which uses blocking I/O (until it gets to "loop until queue empty").

A couple of things to note. First, the timestamps on the SystemTap output are not increasing: it is not safe to infer temporal ordering from SystemTap ouput. Second, note that the timestamps on the network capture (done locally) are different. FWIW, the interface is a cheap 1G in a cheap tower computer.

So at this point, I think we more or less know how af_packet is processing the shared tx ring. What comes next is how the frames in the tx ring find their way to the network interface. It might be helpful to review this section (on how layer 2 transmission is handled) of an overview of the control flow in the linux networking kernel.

OK, so if you have a basic understanding of how layer 2 transmission is handled, it would seem like this packet mmap interface should be an enormous fire hose; load up a shared tx ring with packets, call sendto() with MSG_DONTWAIT, and then tpacket_snd will iterate through the tx queue creating skb's and enqueueing them onto the qdisc. Asychronously, skb's will be dequeued from the qdisc and sent to the hardware tx ring. The skb's should be non-linear so they will reference the data in the tx ring rather than copy, and a nice modern NIC should be able to handle scattered data and reference the data in the tx rings as well. Of course, any of these assumptions could be wrong, so lets try to dump a whole lot of hurt on a qdisc with this fire hose.

But first, a not commonly understood fact about how qdiscs work. They hold a bounded amount of data (generally counted in number of frames, but in some cases it could be measured in bytes) and if you try to enqueue a frame to a full qdisc, the frame will generally be dropped (depending on what the enqueuer decides to do). So I will give out the hint that my original hypothesis was that the OP was using packet mmap to blast frames into a qdisc so fast that many were being dropped. But don't hold too fast to that idea; it takes you in a direction, but always keep an open mind. Let's give it a try to find out what happens.

First problem in trying this out is that the default qdisc pfifo_fast doesn't keep statistics. So let's replace that with the qdisc pfifo which does. By default pfifo limits the queue to TXQUEUELEN frames (which generally defaults to 1000). But since we want demonstrate overwhelming a qdisc, let's explicitly set it to 50:

$ sudo tc qdisc add dev eth0 root pfifo limit 50
$ tc -s -d qdisc show dev eth0
qdisc pfifo 8004: root refcnt 2 limit 50p
 Sent 42 bytes 1 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0

Let's also measure how long it takes to process the frames in tpacket_snd with the SystemTap script call-return.stp:

# This is specific to net/packet/af_packet.c 3.13.0-116

function print_ts() {
  ts = gettimeofday_us();
  printf("[%10d.%06d] ", ts/1000000, ts%1000000);
}

# 2088 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
# 2089 {
# [...]
# 2213  return err;
# 2214 }

probe kernel.function("tpacket_snd") {
  print_ts();
  printf("tpacket_snd: args(%s)\n", $$parms);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2213") {
  print_ts();
  printf("tpacket_snd: return(%d)\n", $err);
}

Start the SystemTap script with sudo stap call-return.stp and then let's blast 8096 1500 byte frames into that qdisc with a meager 50 frame capacity:

$ sudo ./packet_mmap -c 8096 -s 1500 eth0
[...]
STARTING TEST:
data offset = 32 bytes
start fill() thread
send 8096 packets (+12144000 bytes)
end of task fill()
Loop until queue empty (0)
END (number of error:0)

So let's check how many packets were dropped by the qdisc:

$ tc -s -d qdisc show dev eth0
qdisc pfifo 8004: root refcnt 2 limit 50p
 Sent 25755333 bytes 8606 pkt (dropped 1, overlimits 0 requeues 265) 
 backlog 0b 0p requeues 265

WAT? Dropped one of 8096 frames dumped onto a 50 frame qdisc? Let's check the SystemTap output:

[1492603552.938414] tpacket_snd: args(po=0xffff8801673ba338 msg=0x14)
[1492603553.036601] tpacket_snd: return(12144000)
[1492603553.036706] tpacket_snd: args(po=0x0 msg=0x14)
[1492603553.036716] tpacket_snd: return(0)

WAT? It took nearly 100ms to process 8096 frames in tpacket_snd? Let's check how long that would actually take to transmit; that's 8096 frames at 1500 bytes/frame at 1gigabit/s ~= 97ms. WAT? It smells like something is blocking.

Let's take a closer look at tpacket_snd. Groan:

skb = sock_alloc_send_skb(&po->sk,
                 hlen + tlen + sizeof(struct sockaddr_ll),
                 0, &err);

That 0 looks pretty innocuous, but that is actually the noblock argument. It should be msg->msg_flags & MSG_DONTWAIT (it turns out this is fixed in 4.1). What is happening here is that the size of the qdisc is not the only limiting resource. If allocating space for the skb would exceed the size of the socket's sndbuf limit, then this call will either block to wait for skb's to be freed up or return -EAGAIN to a non-blocking caller. In the fix in V4.1, if non-blocking is requested it will return the number of bytes written if non-zero, otherwise -EAGAIN to the caller, which almost seems like someone doesn't want you to figure out how to use this (e.g. you fill up a tx ring with 80MB of data, call sendto with MSG_DONTWAIT, and you get back a result that you sent 150KB rather than EWOULDBLOCK).

So if you are running a kernel prior to 4.1 (I believe the OP is running >4.1 and is not affected by this bug), you will need to patch af_packet.c and build a new kernel or upgrade to a kernel 4.1 or better.

I have now booted a patched version of my kernel, since the machine I am using is running 3.13. While we won't block if the sndbuf is full, we still will return with -EAGAIN. I made some changes to packet_mmap.c to increase the default size of the sndbuf and to use SO_SNDBUFFORCE to override the system max per socket if necessary (it appears to need about 750 bytes + the frame size for each frame). I also made some additions to call-return.stp to log the sndbuf max size (sk_sndbuf), the amount used (sk_wmem_alloc), any error returned by sock_alloc_send_skb and any error returned from dev_queue_xmit on enqueuing the skb to the qdisc. Here is the new version:

# This is specific to net/packet/af_packet.c 3.13.0-116

function print_ts() {
  ts = gettimeofday_us();
  printf("[%10d.%06d] ", ts/1000000, ts%1000000);
}

# 2088 static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
# 2089 {
# [...]
# 2133  if (size_max > dev->mtu + reserve + VLAN_HLEN)
# 2134      size_max = dev->mtu + reserve + VLAN_HLEN;
# 2135 
# 2136  do {
# [...]
# 2148      skb = sock_alloc_send_skb(&po->sk,
# 2149              hlen + tlen + sizeof(struct sockaddr_ll),
# 2150              msg->msg_flags & MSG_DONTWAIT, &err);
# 2151 
# 2152      if (unlikely(skb == NULL))
# 2153          goto out_status;
# [...]
# 2181      err = dev_queue_xmit(skb);
# 2182      if (unlikely(err > 0)) {
# 2183          err = net_xmit_errno(err);
# 2184          if (err && __packet_get_status(po, ph) ==
# 2185                 TP_STATUS_AVAILABLE) {
# 2186              /* skb was destructed already */
# 2187              skb = NULL;
# 2188              goto out_status;
# 2189          }
# 2190          /*
# 2191           * skb was dropped but not destructed yet;
# 2192           * let's treat it like congestion or err < 0
# 2193           */
# 2194          err = 0;
# 2195      }
# 2196      packet_increment_head(&po->tx_ring);
# 2197      len_sum += tp_len;
# 2198  } while (likely((ph != NULL) ||
# 2199          ((!(msg->msg_flags & MSG_DONTWAIT)) &&
# 2200           (atomic_read(&po->tx_ring.pending))))
# 2201      );
# [...]
# 2213  return err;
# 2214 }

probe kernel.function("tpacket_snd") {
  print_ts();
  printf("tpacket_snd: args(%s)\n", $$parms);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2133") {
  print_ts();
  printf("tpacket_snd:2133: sk_sndbuf =  %d sk_wmem_alloc = %d\n", 
     $po->sk->sk_sndbuf, $po->sk->sk_wmem_alloc->counter);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2153") {
  print_ts();
  printf("tpacket_snd:2153: sock_alloc_send_skb err = %d, sk_sndbuf =  %d sk_wmem_alloc = %d\n", 
     $err, $po->sk->sk_sndbuf, $po->sk->sk_wmem_alloc->counter);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2182") {
  if ($err != 0) {
    print_ts();
    printf("tpacket_snd:2182: dev_queue_xmit err = %d\n", $err);
  }
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2187") {
  print_ts();
  printf("tpacket_snd:2187: destructed: net_xmit_errno = %d\n", $err);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2194") {
  print_ts();
  printf("tpacket_snd:2194: *NOT* destructed: net_xmit_errno = %d\n", $err);
}

probe kernel.statement("tpacket_snd@net/packet/af_packet.c:2213") {
  print_ts();
  printf("tpacket_snd: return(%d) sk_sndbuf =  %d sk_wmem_alloc = %d\n", 
     $err, $po->sk->sk_sndbuf, $po->sk->sk_wmem_alloc->counter);
}

Let's try again:

$ sudo tc qdisc add dev eth0 root pfifo limit 50
$ tc -s -d qdisc show dev eth0
qdisc pfifo 8001: root refcnt 2 limit 50p
 Sent 2154 bytes 21 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
$ sudo ./packet_mmap -c 200 -s 1500 eth0
[...]
c_sndbuf_sz:       1228800
[...]
STARTING TEST:
data offset = 32 bytes
send buff size = 1228800
got buff size = 425984
buff size smaller than desired, trying to force...
got buff size = 2457600
start fill() thread
send: No buffer space available
end of task fill()
send: No buffer space available
Loop until queue empty (-1)
[repeated another 17 times]
send 3 packets (+4500 bytes)
Loop until queue empty (4500)
Loop until queue empty (0)
END (number of error:0)
$  tc -s -d qdisc show dev eth0
qdisc pfifo 8001: root refcnt 2 limit 50p
 Sent 452850 bytes 335 pkt (dropped 19, overlimits 0 requeues 3) 
 backlog 0b 0p requeues 3

And here is the SystemTap output:

[1492759330.907151] tpacket_snd: args(po=0xffff880393246c38 msg=0x14)
[1492759330.907162] tpacket_snd:2133: sk_sndbuf =  2457600 sk_wmem_alloc = 1
[1492759330.907491] tpacket_snd:2182: dev_queue_xmit err = 1
[1492759330.907494] tpacket_snd:2187: destructed: net_xmit_errno = -105
[1492759330.907500] tpacket_snd: return(-105) sk_sndbuf =  2457600 sk_wmem_alloc = 218639
[1492759330.907646] tpacket_snd: args(po=0x0 msg=0x14)
[1492759330.907653] tpacket_snd:2133: sk_sndbuf =  2457600 sk_wmem_alloc = 189337
[1492759330.907688] tpacket_snd:2182: dev_queue_xmit err = 1
[1492759330.907691] tpacket_snd:2187: destructed: net_xmit_errno = -105
[1492759330.907694] tpacket_snd: return(-105) sk_sndbuf =  2457600 sk_wmem_alloc = 189337
[repeated 17 times]
[1492759330.908541] tpacket_snd: args(po=0x0 msg=0x14)
[1492759330.908543] tpacket_snd:2133: sk_sndbuf =  2457600 sk_wmem_alloc = 189337
[1492759330.908554] tpacket_snd: return(4500) sk_sndbuf =  2457600 sk_wmem_alloc = 196099
[1492759330.908570] tpacket_snd: args(po=0x0 msg=0x14)
[1492759330.908572] tpacket_snd:2133: sk_sndbuf =  2457600 sk_wmem_alloc = 196099
[1492759330.908576] tpacket_snd: return(0) sk_sndbuf =  2457600 sk_wmem_alloc = 196099

Now things are working as expected; we have fixed a bug causing us to block of the sndbuf limit is exceeded and we have adjusted the sndbuf limit so that it should not be a constraint, and now we see the frames from the tx ring are enqueued onto the qdisc until it is full, at which point we get returned ENOBUFS.

The next problem is now how to efficiently keep publishing to the qdisc to keep the interface busy. Note that the implementation of packet_poll is useless in the case that that we fill up the qdisc and get back ENOBUFS, because it just queries if the head is TP_STATUS_AVAILABLE, which in this case will remain TP_STATUS_SEND_REQUEST until a subsequent call to sendto succeeds in queueing the frame to the qdisc. A simple expediency (updated in packet_mmap.c) is to loop on the sendto until success or an error other than ENOBUFS or EAGAIN.

Anyway, we know way more than enough to answer the OPs question now, even if we don't have a complete solution to efficiently keep the NIC from being starved.

From what we have learned, we know that when OP calls sendto with a tx ring in blocking mode, tpacket_snd will start enqueuing skbs onto the qdisc until the sndbuf limit is exceeded (and the default is generally quite small, about 213K, and further, I discovered that frame data referenced in the shared tx ring is counted towards this) when it will block (while still holding pg_vec_lock). As skb's free up, more frames wil be enqueued, and maybe the sndbuf will be exceeded again and we will block again. Eventually, all the data will have beeen queued to the qdisc but tpacket_snd will continue to block until all of the frames have been transmitted (you can't mark a frame in the tx ring as available until the NIC has received it, as an skb in the driver ring references a frame in the tx ring) while still holding pg_vec_lock. At this point the NIC is starved, and any other socket writers have been blocked by the lock.

On the other hand, when OP publishes a packet at a time, it will be handled by packet_snd which will block if there is no room in the sndbuf and then enqueue the frame onto the qdisc, and immediately return. It does not wait for the frame to be transmitted. As the qdisc is being drained, additional frames can be enqueued. If the publisher can keep up, the NIC will never be starved.

Further, the op is copying into the tx ring for every sendto call and comparing that to passing a fixed frame buffer when not using a tx ring. You won't see a speedup from not copying that way (although that is not the only benefit of using the tx ring).

来源：https://stackoverflow.com/questions/43193889/sending-data-with-packet-mmap-and-packet-tx-ring-is-slower-than-normal-withou

标签

performance

sockets

network-programming

circular-buffer