ElasticSearch: Use a compound tenant-ID + page-ID field?

问题

I've just starated devising an ElasticSearch mapping for a multitenant web app. In this app, there are site ID:s and page ID:s. Page ID:s are unique per site, and randomly generated. Pages can have child pages.

What is best:

1) Use a compound key with site + page-ID:s? Like so:

"sitePageIdPath": "(siteID):(grandparent-page-ID).(parent-page-ID).(page-ID)"

or:

2) Use separate fields for site ID and page IDs? Like so:

"siteId": "(siteID)",
"pageIdPath": "(grandparent-page-ID).(parent-page-ID).(page-ID)"

I'm thinking that if I merge site ID and page IDs into one single field, then ElasticSearch will need to handle only that field, and this should be somewhat more performant than using two fields — both when indexing and when searching? And require less storage space.

However perhaps there's some drawback that I'm not aware about? Hence this question.

Some details: 1) I'm using a single index, and I'm over allocating shards (100 shards), as suggested when one uses the "users" data flow pattern. 2) I'm specifying routing parameters explicitly in the URL (i.e. &routing=site-ID), not via any siteId field in the documents that are indexed.

Update 7 hours later:

1) All queries should be filtered by site id (that is, tenant id). If I do combine the site ID with the page ID, I suppose/hope that I can use a prefix filter, to filter on site ID. I wonder if this will be as fast as filtering on a single dedicated siteId field (e.g. can the results be cached).

2) Example queries: Full text search. List all users. List all pages. List all child/successor pages of a certain page. Load a single page (via _source).

Update 22 hours later:

3) I am able to search by page ID, because as ElasticSearch's _id, I store: (site-ID):(page-ID). So it's not a probolem that the page ID is otherwise "hidden" as the last element of pageIdPath. _{I probably should have mentioned earlier that I had a separate page ID field, but I thought let's keep the question short.}

4) I use index: not_analyzed for these ID fields.

回答1:

There are performance issues when indexing and searching if you use 1 field. I think you're mistaken in thinking 1 filed would speed things up.

If using 1 field you have basically 2 mapping choices:

If you use the default mappings, the string (siteID):(grandparent-page-ID).(parent-page-ID).(page-ID) will get broken up by the analyzer to the tokens (siteID) (grandparent-page-ID) (parent-page-ID) (page-ID). Now your ids are like a bag of words and either a term or prefix filter might find a match from the pageID when you meant for it to match the siteID.
If you set your own analyzer (and I would like to know if you can think of a good way of doing this) the first one that comes to mind is the keyword (or not_analyzed) analyzer. This will keep the string as one token so you don't lose the context. However now you have a big performance hit when using a prefix filter. Imagine I index the string "123.456.789" as one token (siteID,parentpageID.pageID). I want to fileter by sideID = 123 and so I use a prefix filter. As you can read here this prefix filter is actually expaned into a bool query of hundreds of terms all ORed together (123 or 1231 or 1232 or 1233 etc...), which is massive waste of computing power when you could just structure your data better.

I urge you to read more about lucene's PrefixQuery and how it works.

If I were you I would do this.

Mapping

"properties": {
  "site_id": {
    "type": "string",
    "index": "not_analyzed" //keyword would also work here, they are basically the same
  },
  "parent_page_id": {
    "type": "string",
    "index": "not_analyzed"
  },
  "page_id": {
    "type": "string",
    "index": "not_analyzed"
  }<
  "page_content": {
    "type": "string",
    "index": "standard" //you may want to use snowball to enable stemming
  }
}

Queries

Text search for "elasticsearch tutorial" under siteID "123"

"filtered": {
  "query": {
    "match": {
      "page_content": "elasticsearch tutorial"
    }
  },
  "filter": {
    "term": {
      "site_id": "123"
    }
  }
}

All child pages of page "456" under site "123"

"filtered": {
  "query": {
    "match_all": {}
  },
  "filter": {
    "and": [
      {
        "term": {
          "site_id": "123"
        }
      },
      {
        "term": {
          "parent_page_id": "456"
        }
      }
  }
}

回答2:

Edit: There's a problem with this answer, namely possible BooleanQuery.TooManyClauses exceptions; please see the update below, after the original answer. /Edit

I think it's okay to combine the site ID and the page ID, and use [a prefix filter that matches on the site ID] when querying. I found this info in the Query DSL docs:

Some filters already produce a result that is easily cacheable, and the difference between caching and not caching them is the act of placing the result in the cache or not. These filters, which include the term, terms, prefix, and range filters

So combining site ID and page ID should be okay w.r.t. performance I think. And I cannot think of any other issues (keeping in mind that looking up by page ID only makes no sense, since the page ID means nothing without the site ID.)

Update:

I'd guess the downvote is mainly 1) because there are performance issues if I combine (Site-ID):(Parent-page-ID):(Page-ID) into one field, and then try to search for the page ID. However the page ID is available in the _id field, which is: (site-ID):(page-ID), so this should not be an issue. (That is, I'm not using only 1 field — I'm using 2 fields.)

The queries that corresponds to Ramseykhalaf's queries would then be:

"filtered": {
  "query": {
    "match": {
      "page_content": "search phrase"
    }
  },
  "filter" : {
    "prefix" : {
      "_id" : "123:"    // site ID is "123"
    }
  }
}

And:

"filtered": {
  "query": {
    "match_all": {}
  },
  "filter": {
    "and": [{
      "prefix" : {
        "_id" : "123:"  // site ID is "123"
      }, {
      "prefix": {
        "pageIdPath": "456:789:"  // section and sub section IDs are 456:789
                               // (I think I'd never search for a *subsection* only,
                               // without also knowing the parent section ID)
      }
    }]
  }
}

_{(I renamed sitePageIdPath to pageIdPath since site-ID is stored in _id)}

Another 2) minor reason for the downvote might be that (and I didn't know about this until now) prefix queries are broken up to boolean queries that match on all terms with the specified prefix, and these boolean queries could in my case include really really many terms, if there are really really many pages (there might be) or section IDs (there aren't) in the relevant website. So using a term query directly is faster? And cannot result in a too-many-clauses exception (see link below).

For more info on PrefixQuery, see:
How to improve a single character PrefixQuery performance? and
With Lucene: Why do I get a Too Many Clauses error if I do a prefix search?

This to-boolean-query transformation apparently happens not only for prefix queries, but for range queries too, see e.g. Help needed figuring out reason for maxClauseCount is set to 1024 error and the Lucene BooleanQuery.TooManyClauses docs: "Thrown when an attempt is made to add more than BooleanQuery.getMaxClauseCount() clauses. This typically happens if a PrefixQuery, FuzzyQuery, WildcardQuery, or TermRangeQuery is expanded to many terms during search"

来源：https://stackoverflow.com/questions/17903024/elasticsearch-use-a-compound-tenant-id-page-id-field

标签

ElasticSearch

multi-tenant