lucene

Handling + as a special character in Lucene search

拟墨画扇 提交于 2020-01-12 18:35:49
问题 How do i make sure lucene gives me back relevant search results when my input string contains terms like c++? Lucene seems to ignore ++ characters. Code details: When I execute this line,I get a blank search query. queryField = multiFieldQueryParser.Parse(inpKeywords); keywordsQuery.Add(queryField, BooleanClause.Occur.SHOULD); And here is my custom analyzer: public class CustomAnalyzer : Analyzer { private static readonly WhitespaceAnalyzer whitespaceAnalyzer = new WhitespaceAnalyzer();

Solr exact word result come first

强颜欢笑 提交于 2020-01-12 10:49:20
问题 In Solr-5.0.0, I have one product_name field. When I search for a word or more than words, its giving results with product names that contain the words. How can I make it as the exact match come first. My Schema.xml is below: <field name="product_name" type="text_wslc" indexed="true" stored="true" required="true" multiValued="false"/> and my field definition is also given below: <fieldType name="text_wslc" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer

Where can I find performance benchmarks for Apache Lucene/Solr

余生颓废 提交于 2020-01-12 06:51:05
问题 Are there any links/resources towards performance benchmarks for Lucene/Solr on large datasets. Data sets above the range of 500GB ~ 5TB Thanks 回答1: Lucene committer Mike McCandless runs benchmarks on a regular basis to track down performances improvements and regressions. They are made with Wikipedia exports, which might be a little bit smaller than what you are looking for. But the performance doesn't depend so much on the input size, but rather on the number of documents and unique terms.

Index a MySQL database with Apache Lucene, and keep them synchronized

给你一囗甜甜゛ 提交于 2020-01-11 17:24:10
问题 When a new item is added in MySQL, it must be also indexed by Lucene. When an existing item is removed from MySQL, it must be also removed from Lucene's index. The idea is to write a script that will be called every x minutes via a scheduler (e.g. a CRON task). This is a way to keep MySQL and Lucene synchronized. What I managed until yet: For each new added item in MySQL, Lucene indexes it too. For each already added item in MySQL, Lucene does not reindex it (no duplicated items). This is the

Lucene+springboot 实现一个简单的搜索

做~自己de王妃 提交于 2020-01-11 15:25:02
1、背景:网站需要实现一个检索,但是mysql的like已经不能满足需求,需要类似全文检索,在之前简单的接触过elasticsearch,感觉类似elasticsearch的搜索可以满足,最后决定集成lucene实现搜索。(可以直接使用es,为什么没有使用就不多说了) 2、环境:java8、springboot2.2,maven,lucene7.6 3、在pom文件中添加依赖 <!-- Lucene --> <!--核心包--> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>7.6.0</version> </dependency> <!--对分词索引查询解析--> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>7.6.0</version> </dependency> <!--一般分词器,适用于英文分词--> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common<

document length in lucene 4.0

空扰寡人 提交于 2020-01-11 13:33:11
问题 as I've read the documentation of the lucene 4.0, now this library stores some statistics as in order to compute different scoring models, one of them bm25. Is there a way, besides fetching a document, to fetch its length too? 回答1: You can store whatever you want from FieldInvertState into the 'norm', and it doesn't have to be a 8 bit float either. The default is a lossy storage of the length, if you want the actual exact length, maybe you choose to use a short (16bits) per document or

Get search word Hits ( number of occurences) per document in Lucene

ⅰ亾dé卋堺 提交于 2020-01-11 11:48:25
问题 Can any one suggest me the best way to get Hits( no of occurrences ) of a word per document in Lucene?.. 回答1: Lucene uses a field-based, rather than document-based, index. In order to get term counts per document: Iterate over documents using IndexReader.document() and isDeleted(). In document d, iterate over fields using Document.getFields(). For each field f, get terms using getTermFreqVector(). Go over the term vector and sum frequencies per terms. The sum of term frequencies per field

Solr - most frequent searched words

℡╲_俬逩灬. 提交于 2020-01-11 09:18:05
问题 I'm trying to organize a solr search engine. I've already set up the misspelling system and the suggestions. However I can't seem to find how to retrieve the top 10 most searched words/terms/keywords in solr/lucene. How can I get this? I want to display those on my homepage. 回答1: Solr does not provide this kind of feature out of the box. There is the StatsComponent, that provides you with all kind of statistics, but all of those are numeric only. Depending on how you access solr (directly or

solr case insensitive sort not working

偶尔善良 提交于 2020-01-11 06:44:06
问题 I have one field in solr schema.xml <field name="short_name" type="text_general" indexed="true" stored="true" required="false" /> <field name="short_name_copy" type="string_ci" indexed="true" stored="true" required="false" /> <copyField source="short_name" dest="short_name_copy"/> and field type <fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true"> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr

基于Lucene的搜索引擎的设计与实现

梦想与她 提交于 2020-01-11 01:18:09
我们处在一个大数据的时代,伴随着网络信息资源的庞大,人们越来越多地注重怎样才能快速有效地从海量的网络信息中,检索出自己需要的、潜在的、有价值的信息,从而可以有效地在日常工作和生活中发挥作用。因为搜索引擎这一技术很好的解决了用户搜索网上大量信息的难题,所以在当今的社会,无论是发展迅猛的计算机行业,还是作为后起之秀的信息产业界,都把Web搜索引擎的技术作为了争相探讨与专研的方向。 搜索引擎的定义就是指按照既定的 策略 与方法,采取相关的 计算机程序 ,通过在 互联网 中进行寻找信息,并显示 信息 ,最后把找到的信息进行整理和筛选,为搜索引擎的使用者提供 检索 信息的服务,终极目标是为了提供给使用者,他所搜索信息相关的资料的计算机 系统 。搜索引擎的种类繁多,既可以进行 全文的索引 ,还可以进行 目录的索引 ,不仅有集合式的搜索引擎,还有 垂直搜索的引擎 以及 元搜索引擎 。除此之外,还有门户搜索引擎和免费链接列表等等 经过对搜索引擎的研究同时与Lucene自身的特性相结合,将本次设计所需要实现的功能阐述如下: 支持桌面文件搜索,格式包括txt、doc、xls和ppt; 支持分词查询 支持全文搜索 能够高亮显示搜索关键字 显示查询所用的时间 显示搜索历史、过滤关键字 分词查询与全文搜索这两项功能,我们都可以利用Lucene本身自带的库加上相关算法就可以完成设计了