Lucene foreign chars problem

孤人 提交于 2019-12-07 08:44:38

问题


I'm having some serious issues using Zend_Lucene and foreign characters like åäö. These issues appear both when the index is created and when it's queried. I've tried both iso-8859-1 and utf-8.

ISO-8859-1

The query that doesn't work looks like "+_area:skåne". With Zend_Lucene I'm getting no matches, but if I run this query in Luke I get many matching docuements.

The index contains 20 fields. The "_area" field is added with the following syntax:

$doc->addField(Zend_Search_Lucene_Field::keyword('_area', strtolower($item['area']), 'iso-8859-1')); 

I am using the Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive analyzer.

While running indexing, the error message below appeared sometimes (the documents indexed were randomly selected from DB with iso-8859-1 encoding)

Notice: iconv(): Detected an illegal character in input string in TextNum.php.

This was "solved" by checking if $this->_input is empty, as it seemed that this caused the notices. Note: The weird query results were a pre-existing condition.

When I search keyword fields using foreign characters I receive the error above, but when I search text fields it behaves differently. Then it generates about a hundred of the error below.

Notice: Undefined offset: 1996 in \Zend\Search\Lucene\Search\Query\MultiTerm.php on line 472

But it produces what looks like a correct result set! On a side note, this second query doesn't generate any results in Luke.

UTF-8

I've also tried UTF-8 because, to my knowledge, Zend_Lucene uses it internally. Since the data set is ISO-8859-1, I convert it using utf8_encode. But the indexing produces the following errors.

Notice: Undefined offset: 266979 in \Zend\Search\Lucene\Index\SegmentInfo.php on line 632

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentMerger.php on line 196

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentMerger.php on line 200

Notice: Undefined index: in \Zend\Search\Lucene\Index\SegmentWriter.php on line 231

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentWriter.php on line 231

Notice: Undefined offset: 250595 in \Zend\Search\Lucene\Index\SegmentInfo.php on line 2020

Notice: Trying to get property of non-object in \Zend\Search\Lucene\Index\SegmentInfo.php on line 2020

Notice: Undefined index: in \Zend\Search\Lucene\Index\SegmentWriter.php on line 465 ...


So. Can someone please shed some light? :) I believe (after days of googling) that I'm not the only one experiencing this.


回答1:


I suggest you try using a UTF-8 compatible text analyzer. It looks like the analyzer you are using destroys the non-ASCII characters. You should make sure that the text is input properly, and that it reaches Lucene in the proper format.



来源:https://stackoverflow.com/questions/1158139/lucene-foreign-chars-problem

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!