Simple query string with special characters such as ( and =

问题

This is my index

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "filter": {
                "my_ascii_folding": {
                    "type" : "asciifolding",
                    "preserve_original": "true"
                }
            },
            "analyzer": {
                "include_special_character": {
                    "type":      "custom",
                    "filter": [
                        "lowercase",
                        "my_ascii_folding"
                    ],
                    "tokenizer": "whitespace"
                }
            }
        }
    }
}

This is my mapping:

PUT /my_index/_mapping/formulas
{
   "properties": {
      "content": {
         "type": "text",
         "analyzer": "include_special_character"
      }
   }
}

My sample data:

POST /_bulk
{"index":{"_index":"my_index","_type":"formulas"}}
{"content":"formula =IF(SUM(3;4;5))"}
{"index":{"_index":"my_index","_type":"formulas"}}
{"content":"some if words: dif difuse"}

In this query I'd like to return back just the record with the formula ("formula =IF(SUM(3;4;5))") but it is returning both.

GET /my_index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "if(",
        "analyzer": "include_special_character",
        "fields": ["_all"]
    }
  }
}

And this query does not return the record with the formula.

GET /my_index/_search
{
  "query": {
    "simple_query_string" : {
        "query": "=if(",
        "analyzer": "include_special_character",
        "fields": ["_all"]
    }
  }
}

How can I fix both queries to return what I expect?

Thanks

回答1:

First off, I want to say thank you for all the required requests to get the data set you're working against locally. Makes it much easier to look into the answer for a question.

There are some rather interesting things happening here. The first thing I want to point out is what's actually happening with your queries when you're using the _all field, because there is some subtle behavior that can very easily cause confusion.

I'm going to rely on the _analyze endpoint to try to help point out what's going on here.

To begin, here is the query to analyze how your query will be interpreted against the "content" field:

GET my_index/_analyze
{
  "analyzer": "include_special_character",
  "text": [
    "formula =IF(SUM(3;4;5))"
  ],
  "field": "content"
}

And the results:

{
  "tokens": [
    {
      "token": "formula",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "=if(sum(3;4;5))",
      "start_offset": 8,
      "end_offset": 23,
      "type": "word",
      "position": 1
    }
  ]
}

So far, so good. This is probably what you're expecting to see. If you want to really dig into a verbose output of what's occurring, use the following within the analyze query:

explain: true

Now, if you remove the "analyzer" value from that analyzer query, the text output is going to remain the same. This is because we are merely overriding its chosen analyzer with the one its already set. We're falling back on the field we are querying against and its specified analyzer.

To prove that, I will query against a field that has no mapping on the index you provided, specifying the analyzer in one request and without it in another.

In:

GET my_index/_analyze
{
  "analyzer": "include_special_character",
  "text": [
    "formula =IF(SUM(3;4;5))"
  ],
  "field": "test"
}

Out:

{
  "tokens": [
    {
      "token": "formula",
      "start_offset": 0,
      "end_offset": 7,
      "type": "word",
      "position": 0
    },
    {
      "token": "=if(sum(3;4;5))",
      "start_offset": 8,
      "end_offset": 23,
      "type": "word",
      "position": 1
    }
  ]
}

Now without the analyzer specified. In:

GET my_index/_analyze
{
  "text": [
    "formula =IF(SUM(3;4;5))"
  ],
  "field": "test"
}

Out:

{
  "tokens": [
    {
      "token": "formula",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "if",
      "start_offset": 9,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "sum",
      "start_offset": 12,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "3;4;5",
      "start_offset": 16,
      "end_offset": 21,
      "type": "<NUM>",
      "position": 3
    }
  ]
}

In the second example, it's falling back on the default analyzer and interpreting the input that way because there is no such field for "test" with any mapping.

Now for a bit of info on the "_all" field and why you're getting unexpected results. According to the documentation, you should consider the "_all" field to be a special field that exists unless explicitly disabled, which is always treated as a "text" field.

The _all field is just a text field, and accepts the same parameters that other string fields accept, including analyzer, term_vectors, index_options, and store.

For completeness, here is how your other document is analyzed when indexed.

In:

GET my_index/_analyze
{
  "analyzer": "include_special_character",
  "text": [
    "some if words: dif difuse"
  ],
  "field": "content"
}

Out:

{
  "tokens": [
    {
      "token": "some",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "if",
      "start_offset": 5,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "words:",
      "start_offset": 8,
      "end_offset": 14,
      "type": "word",
      "position": 2
    },
    {
      "token": "dif",
      "start_offset": 15,
      "end_offset": 18,
      "type": "word",
      "position": 3
    },
    {
      "token": "difuse",
      "start_offset": 19,
      "end_offset": 25,
      "type": "word",
      "position": 4
    }
  ]
}

Now, with the background on why the analyzer is behaving in a certain way for existing fields, and treating the "_all" field logically as a field that is already mapped as text. It appears that when querying against "_all", the specified analyzer is ignored, disallowing the override that worked above. The results of the following are hopefully less surprising now.

In:

GET my_index/_analyze
{
  "analyzer": "include_special_character",
  "text": [
    "=if("
  ],
  "field": "_all"
}

Out:

{
  "tokens": [
    {
      "token": "if",
      "start_offset": 1,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

In the above example, regardless of what analyzer I specify, because the "_all" field is treated as a mapped text field, it's going to use the analyzer associated to that.

Now, when you search "_all" field, you should notice that you are getting hits because both the indexing and searching analyzer have a term for "if", which causes a hit. Both your indexed terms and query terms are going through the default analyzer, not the one you've specified, when you are using the _all field, making the token "if" present in both of your document's "_all" field AND your query text.

The most interesting part to me is the fact that "=if(" is not returning any hits. I would normally assume that would be exactly equivalent to "if" or "if(" in this scenario, because everything but the "if" portion is being thrown out due to the default analyzer. In the scenario where you are not getting a hit where you would expect, I believe this is related to how the query string is being parsed because of the "=" character. I tried to do some research into what this equal character is doing exactly, but I didn't see any good documentation other than it is part of the Lucene syntax. I don't think knowing what's happening with that equal symbol is important to your question, but it is definitely something I'm curious about if anyone here could shed some light on it.

When trying out your query by stepping away from "simple_query_string", I did manage to see both results turned in either of the following queries...

With equal:

GET /my_index/_search
{
  "query": {
    "match": {
      "_all": "=if("
    }
  }
}

Without equal:

GET /my_index/_search
{
  "query": {
    "match": {
      "_all": "if("
    }
  }
}

So now, with all the above exploration laid out, here are some thoughts on potential approaches to your problem.

Here are the tokens for the document we want to return hits on...

In:

GET my_index/formulas/AV9GIDTggkgblFY6zpKT/_termvectors?fields=content

Out:

{
  "_index": "my_index",
  "_type": "formulas",
  "_id": "AV9GIDTggkgblFY6zpKT",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "content": {
      "field_statistics": {
        "sum_doc_freq": 7,
        "doc_count": 2,
        "sum_ttf": 7
      },
      "terms": {
        "=if(sum(3;4;5))": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 8,
              "end_offset": 23
            }
          ]
        },
        "formula": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 7
            }
          ]
        }
      }
    }
  }
}

Because of the above, if we change your queries from "_all" to "content", you will only be able to get a hit on the document we're interested in with one of the two tokens in the response above. You will get hits for it if you search "=if(sum(3;4;5))" or "formula". While this is becoming more accurate, I don't think it accomplished your goal.

Another approach I may have considered based on the requirements, would be to use keyword mapping. However, this would be even more restrictive than the example, as each "content" field would have exactly one token, that is the entirety of its value. I believe the best fit your problem is going to require us to add an n-gram tokenizer to your mapping.

Here is the series of queries I would use to tackle this problem..

Index settings:

PUT /my_index2
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "filter": {
        "my_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": "true"
        }
      },
      "analyzer": {
        "include_special_character_gram": {
          "type": "custom",
          "filter": [
            "lowercase",
            "my_ascii_folding"
          ],
          "tokenizer": "ngram_tokenizer"
        }
      },
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 5,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
      }
    }
  }
}

Map:

PUT /my_index2/_mapping/formulas
{
   "properties": {
      "content": {
         "type": "text",
         "analyzer": "include_special_character_gram"
      }
   }
}

Add docs:

POST /_bulk
{"index":{"_index":"my_index2","_type":"formulas"}}
{"content":"formula =IF(SUM(3;4;5))"}
{"index":{"_index":"my_index2","_type":"formulas"}}
{"content":"some if words: dif difuse"}

Term vectors of the first doc:

GET my_index2/formulas/AV9GZ3sSgkgblFY6zpK2/_termvectors?fields=content

Out:

{
  "_index": "my_index2",
  "_type": "formulas",
  "_id": "AV9GZ3sSgkgblFY6zpK2",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "content": {
      "field_statistics": {
        "sum_doc_freq": 102,
        "doc_count": 2,
        "sum_ttf": 106
      },
      "terms": {
        "(3": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 46,
              "start_offset": 15,
              "end_offset": 17
            }
          ]
        },
        "(3;": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 47,
              "start_offset": 15,
              "end_offset": 18
            }
          ]
        },
... Omitting the rest because of max response lengths.
    }
  }
}

Now let's wrap this example up... Here is the query I used previously that was returning both of your entries, and continues to do the same here.

In:

GET /my_index2/_search
{
  "query": {
    "match": {
      "content": {
        "analyzer": "keyword",
        "query": "=if("
      }
    }
  }
}

Out:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 2.9511943,
    "hits": [
      {
        "_index": "my_index2",
        "_type": "formulas",
        "_id": "AV9GZ3sSgkgblFY6zpK2",
        "_score": 2.9511943,
        "_source": {
          "content": "formula =IF(SUM(3;4;5))"
        }
      },
      {
        "_index": "my_index2",
        "_type": "formulas",
        "_id": "AV9GZ3sSgkgblFY6zpK3",
        "_score": 0.30116585,
        "_source": {
          "content": "some if words: dif difuse"
        }
      }
    ]
  }
}

So we see the same results, but why is this happening? In the query above, we're now applying the same n-gram analyzer to the input text, meaning both documents will still have matching tokens!

In:

GET my_index2/_analyze
{
  "analyzer": "include_special_character_gram",
  "text": [
    "=if("
  ],
  "field": "t"
}

Out:

{
  "tokens": [
    {
      "token": "=i",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "=if",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "=if(",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "if",
      "start_offset": 1,
      "end_offset": 3,
      "type": "word",
      "position": 3
    },
    {
      "token": "if(",
      "start_offset": 1,
      "end_offset": 4,
      "type": "word",
      "position": 4
    },
    {
      "token": "f(",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 5
    }
  ]
}

If you run the above query, you'll see the tokens generated by the query. The key ingredient here is to now specify your query analyzer as "keyword" so one of your indexed term vectors is going to match the entire query value, by using a different analyzer query than we are for the field..

In:

GET my_index2/_analyze
{
  "analyzer": "keyword",
  "text": [
    "=if("
  ]
}

Out:

{
  "tokens": [
    {
      "token": "=if(",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

Let's see if it works...

In:

GET /my_index2/_search
{
  "query": {
    "match": {
      "content": {
        "query": "=if(",
        "analyzer": "keyword"
      }
    }
  }
}

Out:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.56074005,
    "hits": [
      {
        "_index": "my_index2",
        "_type": "formulas",
        "_id": "AV9GZ3sSgkgblFY6zpK2",
        "_score": 0.56074005,
        "_source": {
          "content": "formula =IF(SUM(3;4;5))"
        }
      }
    ]
  }
}

So based on the above, you can see how it works when we explicitly specify the keyword analyzer for the search analyzer, against the n-gram analyzed field we have stored. Here is an update we can apply to the mapping which will simplify our requests... (Note, you will want to either destroy the existing index or

PUT /my_index2/_mapping/formulas
{
   "properties": {
      "content": {
         "type": "text",
         "analyzer": "include_special_character_gram",
         "search_analyzer": "keyword"

      }
   }
}

Now let's go back to the match query I used initially to show both docs returning.

In:

GET /my_index2/_search
{
  "query": {
    "match": {
      "content": "=if("
    }
  }
}

Out:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.56074005,
    "hits": [
      {
        "_index": "my_index2",
        "_type": "formulas",
        "_id": "AV9GZ3sSgkgblFY6zpK2",
        "_score": 0.56074005,
        "_source": {
          "content": "formula =IF(SUM(3;4;5))"
        }
      }
    ]
  }
}

Edit - Query in simple_query_string

In:

GET /my_index2/_search
{
  "query": {
    "simple_query_string": {
      "query": "=if\\(",
      "fields": ["content"]
    }
  }
}

Out:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.56074005,
    "hits": [
      {
        "_index": "my_index2",
        "_type": "formulas",
        "_id": "AV9GZ3sSgkgblFY6zpK2",
        "_score": 0.56074005,
        "_source": {
          "content": "formula =IF(SUM(3;4;5))"
        }
      }
    ]
  }
}

And there you have it. You can obviously fiddle with n-gram sizes if you choose to go this route. This answer is already long-winded enough so I'm not going to try to provide other approaches you could take to this, but I figure one solution would be helpful. I think the important thing here is to understand what's going on behind the scenes with the _all field and the interpretation of your query string.

Hope this helps and thanks for the interesting question.

来源：https://stackoverflow.com/questions/46877483/simple-query-string-with-special-characters-such-as-and

标签

ElasticSearch