solr

Solr安装配置

坚强是说给别人听的谎言 提交于 2019-12-19 00:24:33
  Solr是一个高性能,采用Java5开发,基于Lucene的全文搜索服务器(Lucene专注于搜索底层的建设,而Solr专注于企业应用。Lucene不负责支撑搜索服务所必须的管理)。同时对其进行了扩展,提供了比Lucene更为丰富的查询语言,同时实现了可配置、可扩展并对查询性能进行了优化,并且提供了一个完善的功能管理界面,是一款非常优秀的全文搜索引擎。   它对外提供类似于Web-service的API接口。用户可以通过http请求,向搜索引擎服务器提交一定格式的XML文件,生成索引;也可以通过Http Get操作提出查找请求,并得到XML格式的返回结果;   安装配置:   1、下载Solr的zip包后解压缩,将dist目录下的war文件改名为solr.war,复制到tomcat的webapps目录   2、设置Solr的主位置;最简便的方法是在tomcat里配置java:comp/env/solr/home的一个JNDI指向solr的主目录(example目录下),建立tomcat安装目录/conf/Catalina/localhost/solr.xml文件   solr.xml <Context docBase="C:/tools/tomcat6.0/webapps/solr.war" debug="0" crossContext="true" >

How to optimize solr index

强颜欢笑 提交于 2019-12-18 19:07:04
问题 How to optimize solr index. I want to optimize my solr indexing for i try to change in solrconfig.xml it getting indexed but i want to how to verify that they are optimized and with which thing are involve in index optimization. 回答1: I find this to be the easiest way to optimize a Solr index. In my context "optimize" means to merge all index segments. curl http://localhost:8983/solr/<core_name>/update -F stream.body=' <optimize />' 回答2: Check the size of respective core before you start. Open

solr——centos7 安装vmware tools问题(kernel headers)

筅森魡賤 提交于 2019-12-18 17:57:31
输入“mkdir /mnt/cdrom”在/mnt目录下新建一个名为cdrom的文件夹 1 mkdir /mnt/cdrom 输入“mount -t iso9660 /dev/cdrom /mnt/cdrom”将光盘挂载到/mnt/cdrom目录下 1 mount -t iso9660 /dev/cdrom /mnt/cdrom 输入“ls /mnt/cdrom/”查看内容,输入“cp /mnt/cdrom/VMwareTools-10.1.6-5214329.tar.gz /tmp 将压缩包拷贝在tmp目录下 1 cp /mnt/cdrom/VMwareTools-10.1.6-5214329.tar.gz /tmp 输入“cd /tmp”进入tmp目录输入ls查看当前目录下的所有文件就能看见VMwareTools-10.1.6-5214329.tar.gz压缩包 1 tar -xzf VMwareTools-10.1.6-5214329.tar.gz 输入“ tar -xzf VMwareTools-10.1.6-5214329.tar.gz ”将文件解压,输入“ls”查看文件,可发现新增目录“vmware-tools-distrib” 输入“cd vmware-tools-distrib”进入名为“vmware-tools-distrib”的目录,输入“./vmware

Search for short words with SOLR

江枫思渺然 提交于 2019-12-18 16:58:41
问题 I am using SOLR along with NGramTokenizerFactory to help create search tokens for substrings of words NGramTokenizer is configured with a minimum word length of 3 This means that I can search for e.g. "unb" and then match the word "unbelievable". However I have a problem with short words like "I" and "in". These are not indexed by SOLR (I suspect it is because of NGramTokenizer) and therefore I cannot search for them. I don't want to reduce the minimum word length to 1 or 2, since this

Index pdf documents in Solr from C# client

扶醉桌前 提交于 2019-12-18 15:55:21
问题 Basically I'm trying to index word or pdf documents in Solr and found the ExtractingRequestHandler, but can't figure out how to write code in c# that performs the HTTP POST request like in the Solr wiki: http://wiki.apache.org/solr/ExtractingRequestHandler. I've installed Solr 3.4 on Tomcat 7 (7.0.22) using the files from the example/solr directory in the Solr zip and I haven't altered anything. The ExtractingRequestHandler should be configured out of the box in the solrconfig.xml and ready

Index pdf documents in Solr from C# client

梦想与她 提交于 2019-12-18 15:55:13
问题 Basically I'm trying to index word or pdf documents in Solr and found the ExtractingRequestHandler, but can't figure out how to write code in c# that performs the HTTP POST request like in the Solr wiki: http://wiki.apache.org/solr/ExtractingRequestHandler. I've installed Solr 3.4 on Tomcat 7 (7.0.22) using the files from the example/solr directory in the Solr zip and I haven't altered anything. The ExtractingRequestHandler should be configured out of the box in the solrconfig.xml and ready

Recrawl URL with Nutch just for updated sites

社会主义新天地 提交于 2019-12-18 15:52:43
问题 I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? 回答1: Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz. You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other

Recrawl URL with Nutch just for updated sites

[亡魂溺海] 提交于 2019-12-18 15:51:59
问题 I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? 回答1: Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz. You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other

How does Solr sort by default when using filter query *:*?

拈花ヽ惹草 提交于 2019-12-18 15:16:24
问题 We currently have a page setup that has no filters/facet/queries applied. It is a listing of all content (using a pager). The filter query is something like *:* (anything from any field). I can't figure out how the content is being sorted though. It says it's by "relevancy", but what does that mean when you're selecting everything? I did some quick testing. It does not appear to be sorted by the date the content is modified, or entered into the index. 回答1: Querying for *:* is also called a

Document search on partial words

感情迁移 提交于 2019-12-18 15:16:24
问题 I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms. For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r *brit* Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms.