Solr for Arabic

夙愿已清 提交于 2019-12-09 17:14:22

问题


I'm using Solr to index documents in 3 langues(arabic, french and english), I have used this fieldType :

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> 
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

Everything was good, but in arabic language when I put this request to search a word like حقل Solr doen't find the word, but when I put the word in oppositeلقح from left to right Solr find the word and return result.

Can I have result for arabic words ?


回答1:


I'm going to turn Daniel's clever analysis here to an answer for the record. Don't vote for this, just go find something of his to vote for :-)

There are two ways to get a directionality mismatch with RTL text. You can be indexing it backwards, or you can be querying it backwards. A simple HTML form querying Solr will never mess up directionality. In this care, khaled was extracting text from a PDF using a library that falls victim to the tendency of PDFs to contain 'visual-order' text rather than 'logical order'. So the index was full of backwards Arabic. To fix this, he will have to come up with a working library that extracts text from pdfs.

Forcing Apache Tika to use the latest Apache PDFbox might help, or his PDF may be so quirky that even the latest PDFBox can't handle it. In which case he has a hard problem.



来源:https://stackoverflow.com/questions/7834401/solr-for-arabic

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!