I\'m trying to reduce the execution time of an AppEngine query by running multiple sub-queries asynchronously, using query.fetch_async(). However, it seems that the gain is
Are you always running run_parallel before run_serial? If so ndb caches the results and is able to pull the information much faster. Try flipping the results or even better try with DB, as ndb is just a wrapper to include memcache results.
The main problem is that your example is mostly CPU-bound as opposed to IO-bound. In particular, most of the time is likely spent in decoding RPC results which isn't done efficiently in python due to the GIL. One of the problems with Appstats is that it measures RPC timing from when the RPC is sent to when get_result() is called. This means that time spent before get_result is called will appear to be coming from the RPCs.
If you instead issue IO-bound RPCs (i.e. queries that make the Datastore work harder) you will start to see the performance gains of parallel queries.