Solr: Can't search for numbers mixed with characters

亡梦爱人 提交于 2019-12-23 11:59:34

问题


I have some items in my index (Solr. 4.4), which contain names like Foobar 135g, where the 135g refers to some weights. Searching for foobar or foobar 135 does work, but when I try to search for the exact phrase foobar 135g, nothing is found.

I analysed the query inside the solr admin panel "Analysis". Here everything looks good. The fields are indexed correctly, the query is splitted correctly, and I get hits (indicated by this purple background on the tokens).

But there has to be an issue the way I process the strings on index and/or query time. So this is the field definition, I'm using:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
    <filter class="solr.ReverseStringFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
    <filter class="solr.ReverseStringFilterFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I'm using the two ReverseStringFilterFactory's with the EdgeNGramFilterFactory's to be able to search for foob and for bar or obar (strings that appear at the end of an item name). First I thought, it has something to do with the WordDelimiterFilterFactory and the catenateWords options. But this option doesn't do anything with numbers in it (am I right?).

After reading the documentation (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters) I found generateNumberParts which default is 1. This leads to splitting 135g into 135 and g. But as long as I have the preserveOriginal option enabled, the 135g is also indexed as a whole string. This is also shown in the Analysis panel from the admin interface:

Does anybody know what kind of filter, tokenizer... is causing this issue?

UPDATE

I've found out something interesting. When I debug the query for the search 135g, I get the following debug output:

<lst name="debug">
  <str name="rawquerystring">name_texts:135g</str>
  <str name="querystring">name_texts:135g</str>
  <str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
  <str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>
  <lst name="explain"/>
  <str name="QParser">LuceneQParser</str>
  ...
</lst>

I understand, that because of the earlier mentioned solr.WordDelimiterFilterFactory, the string get's splitted into this parts. But why is Solr converting it into a MultiPhraseQuery? I'm a little bite confused right now, I thought that every single token generated by the solr.WordDelimiterFilterFactory on query time would trigger a seperated search (or at least, a OR statement between the tokens).

Please, someone clear up my mind, I'm kinda confused ;) How can I avoid this?


回答1:


It is the WordDelimiterFilterFactory. You should be able to see it in your admin panel under analysis. To not do that use : splitOnNumerics="0" as attribute.

Update:

Read more about it here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

solr.WordDelimiterFilterFactory

Creates solr.analysis.WordDelimiterFilter.

Splits words into subwords and performs optional transformations on subword groups. By default, words are split into subwords with the following rules:

splitOnNumerics="1" causes alphabet => number transitions to generate a new part [Solr 1.3]: "j2se" => "j" "2" "se" default is true ("1"); set to 0 to turn off

Update 2

Based on your latest comment, i now understood what you meant. I took your field type definition and indexed on solr4.5.1 with your sentence and was able to search for test_mytext:"foobar 135g" , test_mytext:foobar 135g, test_mytext:foobar 135g , test_mytext:foobar , test_mytext:135g, test_mytext:135. where test_mytext is of type you defined in your question above. So i do not know why you are unable to find in your own index. Make sure your field is defined some thing like this: <field name="text" type="mytext" indexed="true" stored="true"/>

Upadate 3 Here is my debug log, with your field definition, not sue why you are seeing completely different processing: Query => test_mytext:135g debug": { "rawquerystring": "test_mytext:135g", "querystring": "test_mytext:135g", "parsedquery": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "parsedquery_toString": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "explain": { "200": "\n0.8563627 = (MATCH) product of:\n 1.141817 = (MATCH) sum of:\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.4336574 = (MATCH) weight(test_mytext:135 in 1) [DefaultSimilarity], result of:\n 0.4336574 = score(doc=1,freq=3.0 = termFreq=3.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.94313055 = fieldWeight in 1, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.75 = coord(3/4)\n" },

I am using solr 4.5.1 .

Update 4 Then i noticed that you are using Solr 4.4.0. I took your exact field definition and phrase and ran a query and it finds your result.

Query => name_texts:"135g"

Result:

<result name="response" numFound="1" start="0">
  <doc>
    <str name="id">100</str>
    <str name="name_texts">Foobar 135g</str>
    <long name="_version_">1456487722571005952</long></doc>
</result>
<lst name="debug">
  <str name="rawquerystring">name_texts:"135g"</str>
  <str name="querystring">name_texts:"135g"</str>
  <str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
  <str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>

Your processing looks correct and it find result in my instance. I first thought you had extra , but looks like is not causing issue in my local instance. The best place to look for these issues is to use the admin analysis page and debug queries, which you are already doing. I can not think of any thing else as i am unable to reproduce. Do yourself a favor by just taking a clean instance of solr with only change to schema.xml for your field definition and index just this through admin panel (documents) => {"id":"100","name_texts":"Foobar 135g"} . Run this query http://localhost:8983/solr/collection1/select?q=name_texts%3A%22135g%22&wt=xml&indent=true&debugQuery=true



来源:https://stackoverflow.com/questions/20884338/solr-cant-search-for-numbers-mixed-with-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!