ndb Models are not saved in memcache when using MapReduce

问题

I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.

I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.

However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).

Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:

def product_bulk_import_map(data):
    """Product Bulk Import map function."""

    result = {"status" : "CREATED"}
    product_data = data

    try:
        # parse input parameter tuple
        byteoffset, line_data = data

        # parse base product data
        product_data = [x for x in csv.reader([line_data])][0]
        (p_id, c_id, p_type, p_description) = product_data

        # process category
        category = Category.get_by_id(c_id)
        if category is None:
            raise Exception(product_import_error_messages["category"] % c_id)

        # store in datastore
        product = Product.get_by_id(p_id)
        if product is not None:
            result["status"] = "UPDATED"
            product.category = category.key
            product.product_type = p_type
            product.description = p_description
        else:
            product = Product(
                id = p_id,
                category = category.key,
                product_type = p_type,
                description = p_description
            )
        product.put()
        result["entity"] = product.to_dict()
    except Exception as e:
        # catch any exceptions, and note failure in output
        result["status"] = "FAILED"
        result["entity"] = str(e)

    # return results
    yield (str(product_data), result)

回答1:

MapReduce intentionally disables memcache for NDB.

See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):

def _set_ndb_cache_policy():
  """Tell NDB to never cache anything in memcache or in-process.

  This ensures that entities fetched from Datastore input_readers via NDB
  will not bloat up the request memory size and Datastore Puts will avoid
  doing calls to memcache. Without this you get soft memory limit exits,
  which hurts overall throughput.
  """
  ndb_ctx = ndb.get_context()
  ndb_ctx.set_cache_policy(lambda key: False)
  ndb_ctx.set_memcache_policy(lambda key: False)

You can force get_by_id() and put() to use memcache, eg:

product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)

Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:

ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)

As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).

It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.

回答2:

As Slawek Rewaj already mentioned this is caused by the in-context cache. When retrieving an entity NDB tries the in-context cache first, then memcache, and finally it retrieves the entity from datastore if it wasn't found neither in the in-context cache nor memcache. The in-context cache is just a Python dictionary and its lifetime and visibility is limited to the current request, but MapReduce does multiple calls to product_bulk_import_map() within a single request.

You can find more information about the in-context cache here: https://cloud.google.com/appengine/docs/python/ndb/cache#incontext

来源：https://stackoverflow.com/questions/26223098/ndb-models-are-not-saved-in-memcache-when-using-mapreduce

标签

google-app-engine

MapReduce

memcached

google-cloud-datastore

app-engine-ndb