Mongoengine is very slow on large documents compared to native pymongo usage

后端 未结 1 364
清歌不尽
清歌不尽 2020-12-31 08:51

I have the following mongoengine model:

class MyModel(Document):
    date        = DateTimeField(required = True)
    data_dict_1 = DictField(required = Fals         


        
相关标签:
1条回答
  • 2020-12-31 09:37

    TL;DR: mongoengine is spending ages converting all the returned arrays to dicts

    To test this out I built a collection with a document with a DictField with a large nested dict. The doc being roughly in your 5-10MB range.

    We can then use timeit.timeit to confirm the difference in reads using pymongo and mongoengine.

    We can then use pycallgraph and GraphViz to see what is taking mongoengine so damn long.

    Here is the code in full:

    import datetime
    import itertools
    import random
    import sys
    import timeit
    from collections import defaultdict
    
    import mongoengine as db
    from pycallgraph.output.graphviz import GraphvizOutput
    from pycallgraph.pycallgraph import PyCallGraph
    
    db.connect("test-dicts")
    
    
    class MyModel(db.Document):
        date = db.DateTimeField(required=True, default=datetime.date.today)
        data_dict_1 = db.DictField(required=False)
    
    
    MyModel.drop_collection()
    
    data_1 = ['foo', 'bar']
    data_2 = ['spam', 'eggs', 'ham']
    data_3 = ["subf{}".format(f) for f in range(5)]
    
    m = MyModel()
    tree = lambda: defaultdict(tree)  # http://stackoverflow.com/a/19189366/3271558
    data = tree()
    for _d1, _d2, _d3 in itertools.product(data_1, data_2, data_3):
        data[_d1][_d2][_d3] = list(random.sample(range(50000), 20000))
    m.data_dict_1 = data
    m.save()
    
    
    def pymongo_doc():
        return db.connection.get_connection()["test-dicts"]['my_model'].find_one()
    
    
    def mongoengine_doc():
        return MyModel.objects.first()
    
    
    if __name__ == '__main__':
        print("pymongo took {:2.2f}s".format(timeit.timeit(pymongo_doc, number=10)))
        print("mongoengine took", timeit.timeit(mongoengine_doc, number=10))
        with PyCallGraph(output=GraphvizOutput()):
            mongoengine_doc()
    

    And the output proves that mongoengine is being very slow compared to pymongo:

    pymongo took 0.87s
    mongoengine took 25.81118331072267
    

    The resulting call graph illustrates pretty clearly where the bottle neck is:

    Essentially mongoengine will call the to_python method on every DictField that it gets back from the db. to_python is pretty slow and in our example it's being called an insane number of times.

    Mongoengine is used to elegantly map your document structure to python objects. If you have very large unstructured documents (which mongodb is great for) then mongoengine isn't really the right tool and you should just use pymongo.

    However, if you know the structure you can use EmbeddedDocument fields to get slightly better performance from mongoengine. I've run a similar but not equivalent test code in this gist and the output is:

    pymongo with dict took 0.12s
    pymongo with embed took 0.12s
    mongoengine with dict took 4.3059175412661075
    mongoengine with embed took 1.1639373211854682
    

    So you can make mongoengine faster but pymongo is much faster still.

    UPDATE

    A good shortcut to the pymongo interface here is to use the aggregation framework:

    def mongoengine_agg_doc():
        return list(MyModel.objects.aggregate({"$limit":1}))[0]
    
    0 讨论(0)
提交回复
热议问题