SOLR Dropping Emoji Miscellaneous characters

荒凉一梦 提交于 2019-12-24 21:40:31

问题


It looks like SOLR is considering what should be valid Unicode characters as invalid, and dropping them.

I "proved" this by turning on query debug to see what the parser was doing with my query. Here's an example:

Query = 'ァ☀' (\u30a1\u2600)

Here's what SOLR did with it:

'debug':{ 'rawquerystring':u'\u30a1\u2600', 'querystring':u'\u30a1\u2600', 'parsedquery':u'(+DisjunctionMaxQuery((text:\u30a1)))/no_coord', 'parsedquery_toString':u'+(text:\u30a1)',

As you can see, was OK with 'ァ', but it ATE the "Black Sun" character.

I haven't tried ALL of the Block, but I've confirmed it also doesn't like ⛿ (\u26ff) and ♖ (\u2656).

I'm using SOLR with Jetty, so the various TomCat issues WRT character encoding shouldn't apply.


回答1:


This very likely has more to do with the Analyzer. I don't see anything specifying the treatment of those sorts of characters exactly, but they are probably being treated very much as punctuation by the StandardAnalyzer (or whatever Analyzer you may be using), and so will not be present in the final query. StandardAnalyzer implements the rules set forward in UAX-29, Unicode Text Segmentation, in order to separate input into tokens.



来源:https://stackoverflow.com/questions/19773786/solr-dropping-emoji-miscellaneous-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!