Indexing wikipedia with solr

半城伤御伤魂 提交于 2019-12-29 09:16:06

问题


I've installed solr 4.6.0 and follow the tutorial available at Solr's home page. Everything was fine, untill I need to do a real job that I'm about to do. I have to get a fast access to wikipedia content and I was advised to use Solr. Well, I was trying to follow the example in the link http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia, but I couldn't get the example. I am newbie, and I don't know what means data_config.xml!

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/data/enwiki-20130102-pages-articles.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

I couldn't find in the Solr home directory. Also, I tried to find some questions related to mine, How to index wikipedia files in .xml format into solr and Indexing wikipedia dump with solr, but they didn't solve my doubt.

I think I need something more basic, guiding me step by step, because the tutorial is confusing when deals with indexing wikipedia.

Any advice to give some directions to folow would be nice.


回答1:


  • For the data_config.xml

    Each Solr instance is configured using three main files:solr.xml,solrconfig.xml,schema.xml,and the data_config.xml file define the data source when you use DIH component,this URL would be usefull for you :DIH.

  • About Solr home directory

    You should start from here:https://cwiki.apache.org/confluence/display/solr/Running+Solr




回答2:


Well, I've read many things on the Web and tried to collected as many information as possible. This is how I could find the solution:

here is my solrconfig.xml:

...
  <!-- ****** Data import handler -->
  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
    </lst>
  </requestHandler>
...
  <lib dir="../../../dist/" regex="solr-dataimporthandler-.*\.jar" />

Here is my data-config.xml: (important: it must be in the same folder of solrconfig.xml)

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/Applications/solr-4.6.0/example/exampledocs/simplewikiSubSet.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

Attention: The last line is very important!

My schema.xml:

...
   <field name="id"        type="string"  indexed="true" stored="true" required="true"/>
   <field name="title"     type="string"  indexed="true" stored="false"/>
   <field name="revision"  type="int"    indexed="true" stored="true"/>
   <field name="user"      type="string"  indexed="true" stored="true"/>
   <field name="userId"    type="int"     indexed="true" stored="true"/>
   <field name="text"      type="text_en"    indexed="true" stored="false"/>
   <field name="timestamp" type="date"    indexed="true" stored="true"/>
   <field name="titleText" type="text_en"    indexed="true" stored="true"/>
...
 <uniqueKey>id</uniqueKey>
...
   <copyField source="title" dest="titleText"/>
...

And it's done. That's all folks!



来源:https://stackoverflow.com/questions/20473798/indexing-wikipedia-with-solr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!