How is it that json serialization is so much faster than yaml serialization in Python?

前端 未结 5 1886
不思量自难忘°
不思量自难忘° 2020-12-12 15:12

I have code that relies heavily on yaml for cross-language serialization and while working on speeding some stuff up I noticed that yaml was insanely slow compared to other

5条回答
  •  北海茫月
    2020-12-12 15:32

    Although you have an accepted answer, unfortunately that only does some handwaving in the direction of the PyYAML documentation and quotes a statement in that documentation that is not correct: PyYAML does not make a representation graph during dumping, it creates a lineair stream (and just like json keeps a bucket of IDs to see if there are recursions).


    First of all you have to realize that while the cjson dumper is handcrafted C-code only, YAML's CSafeDumper shares two of the four dump stages (Representer and Resolver) with the normal pure Python SafeDumper and that the other two stages (the Serializer and Emitter) are not written completely handcrafted in C, but consist of a Cython module which calls the C library libyaml for emitting.


    Apart from that significant part, the simple answer to your question why it takes longer, is that dumping YAML does more. This is not so much because YAML is harder as @flow claims, but because that extra that YAML can do, makes it so much more powerful than JSON and also more user friendly, if you need to process the result with an editor. That means more time is spent in the YAML library even when applying these extra features, and in many cases also just checking if something applies.

    Here is an example: even if you have never gone through the PyYAML code, you'll have noticed that the dumper doesn't quote foo and bar. That is not because these strings are are keys, as YAML doesn't have the restriction that JSON has, that a key for a mapping needs to be string. E.g. a Python string that is a value in mapping can also be unquoted (i.e. plain).

    The emphasis is on can, because it is not always so. Take for instance a string that consists of numeral characters only: 12345678. This needs to be written out with quotes as otherwise this would look exactly like a number (and read back in as such when parsing).

    How does PyYAML know when to quote a string and when not? On dumping it actually first dumps the string, then parses the result to make sure, that when it reads that result back, it gets the original value. And if that proves not to be the case, it applies quotes.

    Let me repeat the important part of the previous sentence again, so you don't have to re-read it:

    it dumps the string, then parses the result

    This means it applies all of the regex matching it does when loading to see if the resulting scalar would load as an integer, float, boolean, datetime, etc., to determine whether quotes need to be applied or not.¹


    In any real application with complex data, a JSON based dumper/loader is too simple to use directly and a lot more intelligence has to be in your program compared to dumping the same complex data directly to YAML. A simplified example is when you want to work with date-time stamps, in that case you have to convert a string back and forth to datetime.datetime yourself if you are using JSON. During loading you have to do that either based on the fact that this is a value associated with some (hopefully recognisable) key:

    { "datetime": "2018-09-03 12:34:56" }
    

    or with a position in a list:

    ["FirstName", "Lastname", "1991-09-12 08:45:00"]
    

    or based on the format of the string (e.g. using regex).

    In all of these cases much more work needs to be done in your program. The same holds for dumping and that does not only mean extra development time.

    Lets regenerate your timings with what I get on my machine so we can compare them with other measurements. I rewrote your code somewhat, because it was incomplete (timeit?) and imported other things twice. It was also impossible to just cut and paste because of the >>> prompts.

    from __future__ import print_function
    
    import sys
    import yaml
    import cjson
    from timeit import timeit
    
    NR=10000
    ds = "; d={'foo': {'bar': 1}}"
    d = {'foo': {'bar': 1}}
    
    print('yaml.SafeDumper:', end=' ')
    yaml.dump(d, sys.stdout, Dumper=yaml.SafeDumper)
    print('cjson.encode:   ', cjson.encode(d))
    print()
    
    
    res = timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml"+ds, number=NR)
    print('yaml.SafeDumper ', res)
    res = timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml"+ds, number=NR)
    print('yaml.CSafeDumper', res)
    res = timeit("cjson.encode(d)", setup="import cjson"+ds, number=NR)
    print('cjson.encode    ', res)
    

    and this outputs:

    yaml.SafeDumper: foo: {bar: 1}
    cjson.encode:    {"foo": {"bar": 1}}
    
    yaml.SafeDumper  3.06794905663
    yaml.CSafeDumper 0.781533956528
    cjson.encode     0.0133550167084
    

    Now lets dump a simple data structure that includes a datetime

    import datetime
    from collections import Mapping, Sequence  # python 2.7 has no .abc
    
    d = {'foo': {'bar': datetime.datetime(1991, 9, 12, 8, 45, 0)}}
    
    def stringify(x, key=None):
        # key parameter can be used to dump
        if isinstance(x, str):
           return x
        if isinstance(x, Mapping):
           res = {}
           for k, v in x.items():
               res[stringify(k, key=True)] = stringify(v)  # 
           return res
        if isinstance(x, Sequence):
            res = [stringify(k) for k in x]
            if key:
                res = repr(res)
            return res
        if isinstance(x, datetime.datetime):
            return x.isoformat(sep=' ')
        return repr(x)
    
    print('yaml.CSafeDumper:', end=' ')
    yaml.dump(d, sys.stdout, Dumper=yaml.CSafeDumper)
    print('cjson.encode:    ', cjson.encode(stringify(d)))
    print()
    

    This gives:

    yaml.CSafeDumper: foo: {bar: '1991-09-12 08:45:00'}
    cjson.encode:     {"foo": {"bar": "1991-09-12 08:45:00"}}
    

    For the timing of the above I created a module myjson that wraps cjson.encode and has the above stringify defined. If you use that:

    d = {'foo': {'bar': datetime.datetime(1991, 9, 12, 8, 45, 0)}}
    ds = 'import datetime, myjson, yaml; d=' + repr(d)
    res = timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup=ds, number=NR)
    print('yaml.CSafeDumper', res)
    res = timeit("myjson.encode(d)", setup=ds, number=NR)
    print('cjson.encode    ', res)
    

    giving:

    yaml.CSafeDumper 0.813436031342
    cjson.encode     0.151570081711
    

    That still rather simple output, already brings you back from two orders of magnitude difference in speed to less than only one order of magnitude.


    YAML's plain scalars and block style formatting make for better readable data. That you can have a trailing comma in a sequence (or mapping) makes for less failures when manually editing YAML data as with same data in JSON.

    YAML tags allow for in-data indication of your (complex) types. When using JSON you have to take care, in your code, of anything more complex than mappings, sequences, integers, floats, booleans and strings. Such code requires development time, and is unlikely to be as fast as python-cjson (you are of course free to write your code in C as well.

    Dumping some data, like recursive data-structures (e.g. topological data), or complex keys is pre-defined in the PyYAML library. There the JSON library just errors out, and implement workaround for that is non-trivial and most likely slows things that speed differences are less relevant.

    Such power and flexibility comes at a price of lower speed. When dumping many simple things JSON is the better choice, you are unlikely going to edit the result by hand anyway. For anyting that involves editing or complex objects or both, you should still consider using YAML.


    ¹ It is possible to force dumping of all Python strings as YAML scalars with (double) quotes, but setting the style is not enough to prevent all readback.

提交回复
热议问题