lucene

How-to index arrays (tags) in CouchDB using couchdb-lucene

喜欢而已 提交于 2019-12-21 02:45:10
问题 The setup: I have a project that is using CouchDB. The documents will have a field called "tags". This "tags" field is an array of strings (e.g., "tags":["tag1","tag2","etc"]). I am using couchdb-lucene as my search provider. The question: What function can be used to get couchdb-lucene to index the elements of "tags"? If you have an idea but no test environment, type it out, I'll try it and give the result here. 回答1: Well it was quite easy after I figured it out. Please realize that the $

lucene index not getting sync when any update occurs in DB through hibernate

爷,独闯天下 提交于 2019-12-21 02:14:10
问题 I am working on some POC stuff on Hibernate Search based on Lucene using below env: hibernate-search-engine-4.4.2.Final.jar lucene-core-3.6.2.jar MySQL 5.5 Use @Indexed annotation on domain class. Use @Field(index=Index.YES, analyze=Analyze.YES, store=Store.NO) over field. Use @IndexedEmbedded over collection of Instance of different domain class. I did explicit indexing ONLY on start of application (as this is written in Hibernate Search API that Hibernate Search will transparently index

Lucene IndexWriter slow to add documents

荒凉一梦 提交于 2019-12-21 01:58:31
问题 I wrote a small loop which added 10,000 documents into the IndexWriter and it took for ever to do it. Is there another way to index large volumes of documents? I ask because when this goes live it has to load in 15,000 records. The other question is how do I prevent having to load in all the records again when the web application is restarted? Edit Here is the code i used; for (int t = 0; t < 10000; t++){ doc = new Document(); text = "Value" + t.toString(); doc.Add(new Field("Value", text,

Solr新特性【4.x,5.x,6.x,7.x】

ぃ、小莉子 提交于 2019-12-21 01:49:50
一.Solr4.x新特性   1.近实时搜索     Solr的近实时搜索【Near Real-Time,NRT】功能实现了文档添加到搜索的快速进行,以应对搜索快速变化的数据。   2.原子更新与乐观并发     原子更新功能允许客户端应用对已有文档上进行添加、更新、删除和对字段增值等操作,而且无需重新发送整个文档。当存在两个请求同时更改同一个文档时,Solr使用乐观机制防止不兼容的更新。简单来说就是Solr使用特殊的_version_版本字段来确保文档的安全更新语义。两个请求中后提交更改的请求将会获得一个过时的版本【这个版本是两请求都未执行之前的版本,然后先提交的请求会执行并修改版本号】,所以会执行失败【请求执行之前需要先确认版本,只有版本一致才可以执行请求】。   3.实时GET功能     无论文档是否已经提交到索引,使用实时GET功能都可以使用唯一标识符检索最新版本的索引【事务日志提供支撑】。这与使用行键【row key】检索数据的Cassandra的键-值存储方式类似。在Solr4之前,除非文档提交到Lucene的索引,否则是检索不出来的。且提交很花费时间,影响查询性能。   4.使用事务日志实现写持续性     当文档发送到Solr进行索引时,会被写到事务日志中,以防止服务器发生故障造成数据丢失。Solr的事务日志处在客户端应用与Lucene索引之间

How to instruct StandardAnalyzer in Lucene to not to remove stop words?

半世苍凉 提交于 2019-12-21 00:14:09
问题 Simple question : How to make Lucene's StandardAnalyzer not to remove stop words when analyzing my sentence ? 回答1: The answer is version-dependent. For Lucene 3.0.3 (current), you need to construct the StandardAnalyzer with an empty set of stop words, using something like this: Analyzer ana = new StandardAnalyzer(LUCENE_30, Collections.emptySet()); 回答2: Update: the answer is version-dependent. For Lucene 4.0, use: Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40, CharArraySet.EMPTY

How to repair corrupted lucene index?

匆匆过客 提交于 2019-12-20 18:33:05
问题 My server was power loss and lucene index was corrupted. I runned IndexChecker but it fail: java -cp /home/dthoai/programs/paesia/checker/lucene-core-3.5.0.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /mnt/peda/paesia/index -fix Opening index @ /mnt/peda/paesia/index ERROR: could not read any segments file in directory java.io.IOException: read past EOF: MMapIndexInput(path="/mnt/peda/paesia/index/segments_ls0l") at org.apache.lucene.store.MMapDirectory$MMapIndexInput

Solr/Solrj: How can I determine the total number of documents in an index?

折月煮酒 提交于 2019-12-20 16:12:11
问题 How can I determine the total number of documents in a Solr index using Solrj? After hours of searching on my own, I actually have an answer (given below); I'm only posting this question so others can find the solution more easily. 回答1: Here's what I'm using. Is this canonical? Is there a better way? SolrQuery q = new SolrQuery("*:*"); q.setRows(0); // don't actually request any data return server.query(q).getResults().getNumFound(); 回答2: Your answer of sending the query : is probably the

Solr/Solrj: How can I determine the total number of documents in an index?

一个人想着一个人 提交于 2019-12-20 16:11:45
问题 How can I determine the total number of documents in a Solr index using Solrj? After hours of searching on my own, I actually have an answer (given below); I'm only posting this question so others can find the solution more easily. 回答1: Here's what I'm using. Is this canonical? Is there a better way? SolrQuery q = new SolrQuery("*:*"); q.setRows(0); // don't actually request any data return server.query(q).getResults().getNumFound(); 回答2: Your answer of sending the query : is probably the

How to search special characters(+ ! \ ? : ) in Lucene

不问归期 提交于 2019-12-20 12:25:31
问题 I want to search special characters in index. I escaped all the special characters in query string but when i perform query as + on lucene in index it create query as +(). Hence it search on no fields. How to solve this problem? My index contains these special characters. 回答1: If you are using the StandardAnalyzer , that will discard non-alphanum characters. Try indexing the same value with a WhitespaceAnalyzer and see if that preserves the characters you need. It might also keep stuff you

Is there a HTML analyzer/tokenizer for Lucene?

Deadly 提交于 2019-12-20 10:57:23
问题 I wanted to index text from html, in Lucene, what is the best way to achieve this ? Is there any good Contrib module that can do this in Lucene ? EDIT Finally ended up using Jericho Parser. It doesn't create DOM and is easy to use. 回答1: I'm assuming that you don't actually want to index the HTML tags. If that's the case, you can first extract text from HTML using Apache Tika. Then you can index the text in Lucene. 回答2: I would recommend using Jsoup HTML parser to extract the text and then use