Huge size(in bytes) difference between pickle protocol 2 and 3

问题

The streamer side keeps sending a sound sample of 2048 bytes along with the time as an integer, together in a tuple that gets pickled using pickle.dumps, and then its send in an UDP packet to the receiver, who then unpickles it, buffers it and then plays the sound sample.

Everything was fine using python 3, the bits/seconds speed on the receiver were expected.

When I runned the streamer in python 2.7, the speed was faster! I tough python 2 was somehow faster.

Then I checked with wireshark the UDP packets the receiver was receiving, and they were larger than they needed to be.

The streamer side:

while True:
    data = next(gen)
    print("data:{}".format(len(data)))
    stime +=1
    msg = (stime,data)
    payload = pickle.dumps(msg)
    print("payload:{}".format(len(payload)))
    bytes_sent = s.sendto(payload,addr)
    time.sleep(INTERVAL)

The receiver side:

while True:
    if stop_receiving.get():
        break
    try:
        (payload,addr) = self.sock.recvfrom(32767)      
        (t,data) = pickle.loads(payload,encoding="bytes")       
        if stime >= self.frame_time.get():
            self.frames.put((t,data))
    except socket.timeout:          
        pass

On python 3.4 using pickle format 3, if I pickle.dumps a tuple of an integer and 2048 bytes, I get 2063 bytes.

Strangely, on python 2.7 using pickle format 2, I get 5933 bytes, almost 3 times more.

Why is this difference so big?

Should I just make a protocol and append those bytes instead? Which I could have, but after I found pickle I tought it would work.

Python docs also say one could use compressing libraries to reduce size, but I dont know if the extra time overhead compensates.

thank you.

回答1:

First, as a general rule, it shouldn't be all that surprising that major new versions of protocols, libraries, etc. have major improvements. Otherwise, why would anyone have bothered to do all the work to create them?

But you're probably looking for specifics.

Before we get into anything else, your big problem is that you aren't comparing protocol 2 and protocol 3, you're comparing protocol 0 and protocol 3. Notice the last line in the pickletools.dumps dumps below: highest protocol among opcodes = 2. If you see 0 instead of 2 there, that means you're using protocol 0. Protocol 0 was designed for human readability (well, at least human debuggability without a library like pickletools), not for compactness. In particular, it's going to backslash-escape non-printable-ASCII bytes, expanding most of them to 4 characters.

So, why are you getting 0 instead of 2? Because, for backward compatibility reasons, the highest protocol is not the default. The default is 0 in 2.x, and 3 in 3.x. See the docs for 2.7 and 3.4.

If you change your code to pickle.dumps(msg, protocol=pickle.HIGHEST_PROTOCOL) (or just protocol=-1), you'll get 2 and 4 instead of 0 and 3. The 2.x will still probably be bigger than the 3.x, for the reasons explained below, but nowhere near the same scale you're seeing now.

If you really want parity, if the protocol-2 results are compact enough for you, you might want to explicitly use protocol=2.

If you want to explicitly go with only 2 or 3, as you thought you were doing, there's no direct way to write that, but protocol=min(3, pickle.HIGHEST_PROTOCOL) will do it.

The pickletools module (and comments in the source code, which is linked from the docs) make it easy to explore the difference.

Let's use a shorter string, to make it easier to look at:

>>> t = (1, string.ascii_lowercase.encode('ascii'))
>>> p2 = pickle.dumps(t, protocol=2)
>>> p3 = pickle.dumps(t, protocol=3)
>>> len(p2), len(p3)
78, 38

So, the obvious difference is still there.

Now, let's look at what's in the pickles. (You'll probably want to use pickletools.dis(p2, annotate=1) in your own interpreter, but since most of the information scrolls off the edge of the screen, that's not as useful here…)

>>> pickletools.dis(p2)
    0: \x80 PROTO      2
    2: K    BININT1    1
    4: c    GLOBAL     '_codecs encode'
   20: q    BINPUT     0
   22: X    BINUNICODE 'abcdefghijklmnopqrstuvwxyz'
   53: q    BINPUT     1
   55: X    BINUNICODE 'latin1'
   66: q    BINPUT     2
   68: \x86 TUPLE2
   69: q    BINPUT     3
   71: R    REDUCE
   72: q    BINPUT     4
   74: \x86 TUPLE2
   75: q    BINPUT     5
   77: .    STOP
highest protocol among opcodes = 2

As you can see, protocol 2 stores bytes as a Unicode string plus a codec.

>>> pickletools.dis(p3)
    0: \x80 PROTO      3
    2: K    BININT1    1
    4: C    SHORT_BINBYTES b'abcdefghijklmnopqrstuvwxyz'
   32: q    BINPUT     0
   34: \x86 TUPLE2
   35: q    BINPUT     1
   37: .    STOP
highest protocol among opcodes = 3

… but protocol 3 stores them as a bytes object, using a new opcode that didn't exist in protocol 2.

In more detail:

The BINUNICODE family of opcodes takes a Unicode string and stores it as length-prefixed UTF-8.

The BINBYTES family of opcodes takes a byte string and stores it as length-prefixed bytes.

Because protocols 1 and 2 don't have BINBYTES, bytes are stored as, in effect, a call to _codecs.encode with the result of b.decode('latin-1') and u'latin-1' as the arguments. (Why Latin-1? Probably because it's the simplest codec that maps every byte to a single Unicode character.)

This adds 40 bytes of fixed overhead (which accounts for the difference between my p2 and p3).

More importantly, for your case, most non-ASCII bytes will end up being two bytes of UTF-8. For random bytes, that's about 51% total overhead.

Note that there is a BINSTRING type in protocol 1 and later, which is pretty similar to BINBYTES, but it's defined as storing bytes in the default encoding, which is pretty much never useful. In 2.x, that wouldn't really make a difference, because you're not going to decode it anyway to get a str, but my guess would be that 2.6+ don't use it for 3.x compatibility.

There's also a STRING type that dates back to protocol 0, which stores an ASCII-encoded repr on the string. I don't think it's ever used in protocols 1 and higher. This would of course blow up any non-printable-ASCII bytes to a 2 or 4 byte backslash escape.

来源：https://stackoverflow.com/questions/26515272/huge-sizein-bytes-difference-between-pickle-protocol-2-and-3

标签

python

sockets

streaming

pickle