lucene

Search for short words with SOLR

江枫思渺然 提交于 2019-12-18 16:58:41
问题 I am using SOLR along with NGramTokenizerFactory to help create search tokens for substrings of words NGramTokenizer is configured with a minimum word length of 3 This means that I can search for e.g. "unb" and then match the word "unbelievable". However I have a problem with short words like "I" and "in". These are not indexed by SOLR (I suspect it is because of NGramTokenizer) and therefore I cannot search for them. I don't want to reduce the minimum word length to 1 or 2, since this

Multiple Field Query handling in Lucene

和自甴很熟 提交于 2019-12-18 16:52:32
问题 I have written an index searcher in Lucene that will search multiple fields in the indexed database. Actually it takes query as two strings one is say title and another is cityname . Now the indexed database has three field: title, address and city . Hit should occur only if the title matches and city name matches. For that purpose I have written the following searcher code using MultiFieldQuerySearcher with the help of a post: public void searchdb(String myQuery, String myCity) throws

全文搜索技术——倒排索引介绍

 ̄綄美尐妖づ 提交于 2019-12-18 15:56:21
文章目录 1.简介 2.详细介绍 1.简介 倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为倒排索引(inverted index)。带有倒排索引的文件我们称为倒排索引文件,简称倒排文件(inverted file)。 倒排文件(倒排索引),索引对象是文档或者文档集合中的单词等,用来存储这些单词在一个文档或者一组文档中的存储位置,是对文档或者文档集合的一种最常用的索引机制。 搜索引擎的关键步骤就是建立倒排索引,倒排索引一般表示为一个关键词,然后是它的频度(出现的次数),位置(出现在哪一篇文章或网页中,及有关的日期,作者等信息),它相当于为互联网上几千亿页网页做了一个索引,好比一本书的目录、标签一般。读者想看哪一个主题相关的章节,直接根据目录即可找到相关的页面。不必再从书的第一页到最后一页,一页一页的查找. 2.详细介绍 0)设有两篇文章1和2   文章1的内容为:Tom lives in Guangzhou,I live in Guangzhou too.   文章2的内容为:He once lived in Shanghai. 1)由于lucene是基于关键词索引和查询的,首先我们要取得这两篇文章的关键词,通常我们需要如下处理措施   a

Recrawl URL with Nutch just for updated sites

社会主义新天地 提交于 2019-12-18 15:52:43
问题 I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? 回答1: Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz. You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other

Recrawl URL with Nutch just for updated sites

[亡魂溺海] 提交于 2019-12-18 15:51:59
问题 I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? 回答1: Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz. You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other

Document search on partial words

感情迁移 提交于 2019-12-18 15:16:24
问题 I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms. For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r *brit* Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms.

How to know if a geo coordinate lies within a geo polygon in elasticsearch?

北战南征 提交于 2019-12-18 13:38:26
问题 I am using elastic search 1.4.1 - 1.4.4. I'm trying to index a geo polygon shape (document) into my index and now when the shape is indexed i want to know if a geo coordinate lies within the boundaries of that particular indexed geo-polygon shape. GET /city/_search { "query":{ "filtered" : { "query" : { "match_all" : {} }, "filter" : { "geo_polygon" : { "location" : { "points" : [ [72.776491, 19.259634], [72.955705, 19.268060], [72.945406, 19.189611], [72.987291, 19.169507], [72.963945, 19

How to solve the 'Lock obtain timed out' when using Solr plainly?

旧时模样 提交于 2019-12-18 13:37:17
问题 I have two cores for our Solr system (Solr version 3.6.1). When I invoke the following command line on our dedicated Solr server to add and then index a file: java -Durl=http://solrprod:8080/solr/original/update -jar /home/solr/solr3/biomina/solr/post.jar /home/solr/tmp/2008/c2m-dump-01.noDEID_clean.xml I get an exception in /usr/share/tomcat7/logs/solr.2013-12-11.log file (after about 6 minutes of waiting): SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:

Sitecore - Indexing data from presentation components with non-context datasources

家住魔仙堡 提交于 2019-12-18 13:31:41
问题 I have a Sitecore site where many of the pages are assembled mainly from various sublayouts pointing at datasources elsewhere in the content tree. Here's a basic example of the problem. Someone viewing a page about apples might see the word 'apple' 10 times. However, Lucene will not index the Apple page item for that word because it's stored in other items. I'm sure this must be a common issue but I can't seem to find any advice on it. 回答1: This is a common issue and there's a solution in

Using RAMDirectory

 ̄綄美尐妖づ 提交于 2019-12-18 12:56:29
问题 When should I use Lucene's RAMDirectory? What are its advantages over other storage mechanisms? Finally, where can I find a simple code example? 回答1: When you don’t want to permanently store your index data. I use this for testing purposes. Add data to your RAMDirectory, Do your unit tests in RAMDir. e.g. public static void main(String[] args) { try { Directory directory = new RAMDirectory(); Analyzer analyzer = new SimpleAnalyzer(); IndexWriter writer = new IndexWriter(directory, analyzer,