ZeroMQ: Packets being lost even after setting HWM and bufsize

问题

We're using a PUSH/PULL Scalable Formal Communication Pattern in ZeroMQ. The sender application sends a total of 30,000 messages, 10kB each. There's a lot of data loss and hence we set the following on sender's side:

zmq_socket = context.socket(zmq.PUSH)
zmq_socket.setsockopt(zmq.SNDBUF, 10240)
zmq_socket.setsockopt(zmq.SNDHWM, 1)
zmq_socket.bind("tcp://127.0.0.1:4999")

On receiver's side:

zmq_socket = context.socket(zmq.PULL)
zmqSocket.setReceiveBufferSize(10240);
zmqSocket.setRcvHWM(1);
zmq_socket.connect("tcp://127.0.0.1:4999")

There's still data loss. Not sure how we can avoid packets being dropped silently.

EDIT 1:
Sender code in Python:

context = zmq.Context()
zmq_socket = context.socket(zmq.PUSH)
zmq_socket.setsockopt(zmq.SNDBUF, 10240)
zmq_socket.setsockopt(zmq.SNDHWM, 1)
zmq_socket.bind("tcp://127.0.0.1:4999")

for file_name in list_of_files:              # Reads data from a list of files:
    while True:                              #       data from a file_name
          with open(os.path.join(self.local_base_dir,file_name), 'r') as sf:
               socket_data = sf.read(5120)
               if socket_data == '':
                  sf.close()
                  break                      #       until EoF

               ret = zmq_socket.send(socket_data)

               if ret == 0:
                  return True

               if ret == -1:
                  print zmq_errno()

Receiver code in Java:

    private ZMQ.Socket zmqSocket = zmqContext.socket(ZMQ.PULL);       
    zmqSocket.setReceiveBufferSize(10240);
    zmqSocket.setRcvHWM(1);
    zmqSocket.connect(socketEndpoint);
    String message = new String(zmqSocket.recv());
    messages.add(message);

回答1:

Remarks on ZeroMQ: ^{based just on what has been posted above so far in an MVCE}

Never pre-maturely optimise resources allocations: all quasi-optimisation decisions introduced in code before the code runtime operations meet all functional requirements are misleading. "Optimising" ill-functioning distributed system makes no sense at all. Only once the runtime was tested and is guaranteed to work as specified, some resources may get fine-tuned ex-post, given some positive output rationale of ( newly accrued cost of doing so + positive performance benefits + reduced resource use ) makes sense for doing so. Never vice versa. Never.

Never design a ZeroMQ infrastructure as a consumable ( ref. as seen in the Java MCVE snippet above ). It takes a remarkable time to the O/S to setup and adjust the ZeroMQ world "under the hood", that the zmq.Context( n_IO_threads ) instantiation and all the ad-hoc derived .Socket() instances have to create ( all inside the O/S scheduler dictated time-frames + under some realistic multi-agent system behaviour dynamics, given the distributed counterparties are introduced into underlying raw socket distributed handshaking & multi-lateral negotiations ), so even if SLOC one-liners or naive school-book examples may look that way, never jump into an ALAP setup of the ZeroMQ infrastructure "right" before reading just one message and dispose the whole cathedral of the ZeroMQ framework ( hidden under the hood ) - be that explicitly or ( the worse ) implicitly - right after .recv() just one message. Never.

Never ignore documented ZeroMQ features. If it states:

ØMQ does not guarantee that the socket will accept as many as ZMQ_SNDHWM messages, and the actual limit may be as much as 60-70% lower depending on the flow of messages on the socket.

one ought never ever attempt to declare .setsockopt( ZMQ_SNDHWM, 1 ). Never.

Never leave the ZeroMQ infrastructure school-book "naked". There are many fair reasons to design principal try: except: finally; structures for the ZeroMQ infrastructure usage so as to release all the resources, down to the O/S maintained port#-s, in every case, incl. the unhandled exceptions. This practice is a serious resource-management must ( not only for re-entrant Factory patterns ) any professional design shall never skip. Never.

   GLOBAL_context = zmq.Context( setMoreIOthreadsForPERFORMANCE ) // future PERF
   PUSH_socket = GLOBAL_context.socket( zmq.PUSH )
// -------------------------------------------------------------- // .SET
   PUSH_socket.setsockopt(       zmq.SNDBUF,    123456 )          // ^ HI 1st
   PUSH_socket.setsockopt(       zmq.SNDHWM,    123456 )          // ^ HI 1st
// PULL_socket.setsockopt(       zmq.MAXMSGSIZE, 12345 )          // ~ LOCK !DDoS
   PUSH_socket.setsockopt(       zmq.AFFINITY,       0 )          // [0] 1st
// -------------------------------------------------------------- // GRACEFUL TERMINATION:
   PUSH_socket.setsockopt(       zmq.LINGER,         0 )          // ALWAYS
// -------------------------------------------------------------- // 
   PUSH_socket.bind( "tcp://127.0.0.1:4999" )                     // IPC: w/o TCP overheads
// -------------------------------------------------------------- //
   ...
   app logic
   ...
// -------------------------------------------------------------- // ALWAYS
   PUSH_socket.close()                                            // ALWAYS
   GLOBAL_context.term()                                          // ALWAYS

Epilogue:

ZeroMQ clearly states, that one either gets the complete message or nothing. So the application design has to take all due care, if in a need to deliver each and every message, as the smart ZeroMQ tools handle many things, but leave this at the designer's responsibility.

Given your comment has explained there is just one receiver for the proposed system, one may rather go into PAIR/PAIR Scalable Formal Communication Pattern + avoiding all the overheads associated with all the layers of the L3|L2|L1->-L1|L2|L3 multi-stack packet assembly / disassembly & O/S buffers management associated with the tcp:// transport-class and go into ipc:// ( if O/S permits ) or vmci:// less complex & thus a lot faster transport-classes, having thus lower ( better ) latency & protocol overheads.

Using PAIR/PAIR archetype also prevents nasty surprises once a DoS-attack .connect()-s onto a PUSH_socket side, where ZeroMQ has no means to avoid such step and such game-changing step efficiently devastates your design efforts, as the default behaviour is to start round-robin serving all the "potentially" connected peers ( even after some attacker(s) 's('ve) been forced to disconnect ).

Last but not least, one ought manage such distributed processes to rather always prefer non-blocking mode of .send() / .recv() calls in professional high-performance signalling / messaging & .poll() based control tools inside the most critical real-time sections of the data-pumping services used inside the respective application domain.

回答2:

Use the ZMQ_IMMEDIATE flag to prevent any loss in messages that got queued to pipes which don't have a completed connection.

From http://api.zeromq.org/4-2:zmq-setsockopt#toc21

ZMQ_IMMEDIATE: Queue messages only to completed connections

By default queues will fill on outgoing connections even if the connection has not completed. This can lead to "lost" messages on sockets with round-robin routing (REQ, PUSH, DEALER). If this option is set to 1, messages shall be queued only to completed connections. This will cause the socket to block if there are no other connections, but will prevent queues from filling on pipes awaiting connection. Option value type int Option value unit boolean Default value 0 (false) Applicable socket types all, only for connection-oriented transports.

zmq_socket.setsockopt(zmq.ZMQ_IMMEDIATE, 1)

来源：https://stackoverflow.com/questions/43771830/zeromq-packets-being-lost-even-after-setting-hwm-and-bufsize

标签

zeromq

distributed-computing

ZeroMQ: Packets being lost even after setting HWM and bufsize

问题

回答1:

Remarks on ZeroMQ: based just on what has been posted above so far in an MVCE

Epilogue:

回答2:

Remarks on ZeroMQ: ^{based just on what has been posted above so far in an MVCE}