Improve throughput of ndb query over large data

大兔子大兔子 提交于 2019-12-03 00:45:29

In case anyone is interested, I was able to significantly increase the throughput of the data processing by re-designing the component - it was suggested that I change the data models but that was not possible.

First, I segmented the data and then processed each data segment in a separate taskqueue.Task instead of calling multiple fetch_page_async from a single task (as I described in the first post). Initially, these tasks were processed by GAE sequentially utilizing only a single Fx instance. To achieve parallelization of the tasks, I moved the component to a specific GAE module and used basic scaling, i.e. addressable Bx instances. When I enqueue the tasks for each data segment, I explicitly instruct which basic instance will handle each task by specifying the 'target' option.

With this design, I was able to process 20.000 entities in total within 4-5 seconds (instead of 40'-60'!), using 5 B4 instances.

Now, this has additional costs because of the Bx instances. We 'll have to fine tune the type and number of basic instances we need.

The new experimental Data Processing feature (an AppEngine API for MapReduce) might be suitable. It uses automatic sharding to execute multiple parallel worker processes, which may or may not help (like the Approach C in the other linked question).

Your comment about "no need to scan all entities" triggers the thought that custom indexes could help your queries. That may entail schema changes to store the data in a less normal form.

Design a solution from the output perspective - what the simplest query is that produces the required results, then what the entity structure is to support such a query, then what work is needed to create and maintain such an entity structure from the current data.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!