How to filter out (broken) HTML Tags in ElasticSearch's Highlights?

时光怂恿深爱的人放手 提交于 2019-12-08 06:54:48

问题


I'm having trouble with the ElasticSearch Grails Plugin, namely the highlighting Feature.

It is returning text with HTML tags, which would not be a big problem, but it is returning broken, cut-off HTML tags as well.

i.e. "href=google.de> Link <a"

Those can't be easily filtered out with a RegEx.

The solution to this seems to be a custom analyzer like this:

'{
   "index" : {
      "analysis" : {
         "analyzer" : {
            "test_1" : {
               "char_filter" : [
                  "html_strip"
               ],
               "tokenizer" : "standard"
            },
            "test_2" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "stop",
                  "asciifolding"
               ],
               "char_filter" : [
                  "html_strip"
               ],
               "tokenizer" : "standard"
            }
         }
      }
   }
}'

From HTML Strip in Elastic Search

The question is how do i get the above into the GRAILS elasticsearch plugin ? (or any other solution for that matter)

来源:https://stackoverflow.com/questions/42004164/how-to-filter-out-broken-html-tags-in-elasticsearchs-highlights

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!