Writing a stream protocol: Message size field or Message delimiter?

问题

I am about to write a message protocol going over a TCP stream. The receiver needs to know where the message boundaries are.

I can either send 1) fixed length messages, 2) size fields so the receiver knows how big the message is, or 3) a unique message terminator (I guess this can't be used anywhere else in the message).

I won't use #1 for efficiency reasons.

I like #2 but is it possible for the stream to get out of sync?

I don't like idea #3 because it means receiver can't know the size of the message ahead of time and also requires that the terminator doesn't appear elsewhere in the message.

With #2, if it's possible to get out of sync, can I add a terminator or am I guaranteed to never get out of sync as long as the sender program is correct in what it sends? Is it necessary to do #2 AND #3?

Please let me know.

Thanks, jbu

回答1:

You are using TCP, the packet delivery is reliable. So the connection either drops, timeouts or you will read the whole message. So option #2 is ok.

回答2:

I agree with sigjuice. If you have a size field, it's not necessary to add and end-of-message delimiter -- however, it's a good idea. Having both makes things much more robust and easier to debug.

Consider using the standard netstring format, which includes both a size field and also a end-of-string character. Because it has a size field, it's OK for the end-of-string character to be used inside the message.

回答3:

Depending on the level at which you're working, #2 may actually not have an issues with going out of sync (TCP has sequence numbering in the packets, and does reassemble the stream in correct order for you if it arrives out of order).

Thus, #2 is probably your best bet. In addition, knowing the message size early on in the transmission will make it easier to allocate memory on the receiving end.

回答4:

Interesting there is no clear answer here. #2 is safe over TCP no matter what, and is done "in the real world" quite often. This is because TCP guarantees that all data arrives both uncorrupted and in the order that it was sent, so there is no possibility that a correct implementation could get out of sync.

回答5:

If you are developing both the transmit and receive code from scratch, it wouldn't hurt to use both length headers and delimiters. This would provide robustness and error detection. Consider the case where you just use #2. If you write a length field of N to the TCP stream, but end up sending a message which is of a size different from N, the receiving end wouldn't know any better and end up confused.

If you use both #2 and #3, while not foolproof, the receiver can have a greater degree of confidence that it received the message correctly if it encounters the delimiter after consuming N bytes from the TCP stream. You can also safely use the delimiter inside your message.

Take a look at HTTP Chunked Transfer Coding for a real world example of using both #2 and #3.