Workaround of GAE's 30 subqueries limitation

问题

I'm writing a news application and I want let my users choose their favourite news sources from a list that contains dozens (~60) of sources(Guardian,Times,...). I have a News entity that contains an indexed property "source" and I'm looking for an approach that will let me bypass the limitation of 30 subqueries imposed by App Engine that prevents me for using the IN and EQUALS filters to get all the news that belongs to a big list of sources.

is there any workaround for this limitation?

Thanks

回答1:

Remember that indexes are expensive - they take a lot of space and multiply the write costs.

I would use a different design. Instead of 60 subqueries (and what happens if your list of sources grows to 500?) I would make the source property unindexed. Then I would load a list of all the latest news and keep it in Memcache. If you lose it, you can always reload it. You can also easily add more items to this list as news come in. You can also split this list into chunks based on time.

Now as users make their calls, you can easily filter this list in memory. Depending on your usage volume, this design will be dozens - or thousands - of times cheaper and work much faster. The biggest difference is that instead of reading the same entities over and over for each user request, you will read them once and serve thousands of requests before you need to read them again.

回答2:

With 60 sources, if the user wants N > 30 of them, you can split the interaction into 2 queries (one for 30, the second one for the rest) and merge the results yourself. There are practical limits since you don't want to end up with an unbound number of queries, of course, but this should scale up to well above your current 60 sources w/o too many issues.

E.g, in Python, to generate a list of queries:

def many_source_queries(sources):
    queries = []
    for i in range(0, len(sources), 30):
        queries.append(News.source.IN(sources[i:i+30]))
    return queries

then to merge the multiple queries' results there are of course many approaches, but a simplistic one where you just fetch everything into a list is rather trivial:

def fetch_many_queries(queries):
    return [x for q in queries for x in q.fetch()]

Of course, you can add filters, ordering (and do a heapq.merge of the streams to keep resulting order), etc, etc. I'm just addressing the "30 subqueries limitation".

Added: this only scales up to a point (esp. in terms of number of sources desired in a query). For example, if entities can have (say) 600 different sources, and a query wants 300 of them, you're going to get about half the datastore returned to you (if the number of news per source is roughly uniform), and it makes no sense to make 10 separate queries for the purpose.

IOW, as I wrote, above "There are practical limits since you don't want to end up with an unbound number of queries". On detecting a query for more than some threshold N of sources (a large fraction of the store's contents) I'd rather make a single query w/o the source filter, and selectively ignore entities with "wrong sources" at app level).

So in this case I'd take a different approach, e.g...:

import itertools as it

def supermany_source_queries(sources):
    return News.query(), set(sources)

def next_up_to_n(n, news_query, sources_set):
    def goodnews(news): return news.source in sources_set
    while True:
        page = list(it.islice(it.ifilter(goodnews, news_query), n))
        if not page: break
        yield page

Here, the main-line code would first call

q, ss = supermany_source_queries(sources)

then prepare the exact query eq from q with whatever .filter and/or .order may be required, and then loop, e.g

for onepage in next_up_to_n(pagesize, eq, ss):
    deal_with_page(onepage)

Of course this could be factored up in several different ways (probably best with a custom class, able to take a different tack depending on the number of sources being asked for), but once again I'm trying to highlight the general idea, i.e...: rather than using a huge number of separate queries for hundreds of sources, when you're getting a large fraction of the datastore back as a result anyway, use a single query (so resign yourself to getting up to all of the datastore instead of, say, half of it, depending of course on other filters and possible early termination in deal_with_page), and use iteration and selection at the application level (with itertools &c) to ignore entities not actually of interest.

来源：https://stackoverflow.com/questions/28521770/workaround-of-gaes-30-subqueries-limitation

标签

google-app-engine

google-cloud-datastore