lucene

Getting the Doc ID in Lucene

房东的猫 提交于 2019-12-31 01:56:06
问题 In lucene, I can do the following doc.GetField("mycustomfield").StringValue(); This retrieves the value of a column in an index's document. My question, for the same 'doc' , is there a way to get the Doc. Id ? Luke displays it hence there must be a way to figure this out. I need it to delete documents on updates. I scoured the docs but have not found the term to use in GetField or if there already is another method. 回答1: Turns out you have to do this: var hits = searcher.Search(query); var

Elasticsearch query time boosting produces result in inadequate order

混江龙づ霸主 提交于 2019-12-30 14:43:28
问题 The ES search result for the given search keyword one two three seems to be wrong after applying boost feature per keyword. Please help me modifying my "faulty" query in order to accomplish "expected result" below as I described. I'm on ES 1.7.4 with LUCENE 4.10.4 Boosting criteria - three is regarded as the most important keyword : word - boost ---- ----- one 1 two 2 three 3 ES index content - just showing MySQL dump to make the post shorter mysql> SELECT id, title FROM post; +----+---------

Best approach for doing full-text search with list-of-integers documents

时光怂恿深爱的人放手 提交于 2019-12-30 13:31:57
问题 I'm working on a C++/Qt image retrieval system based on similarity that works as follows (I'll try to avoid irrelevant or off-topic details): I take a collection of images and build an index from them using OpenCV functions. After that, for each image, I get a list of integer values representing important "classes" that each image belongs to. The more integers two images have in common, the more similar they are believed to be. So, when I want to query the system, I just have to compute the

Why does Lucene cause OOM when indexing large files?

江枫思渺然 提交于 2019-12-30 09:42:32
问题 I’m working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I’m consistently receiving OutOfMemoryError: Java heap space , when trying to index large text files. Example 1: Indexing a 5 MB text file runs out of memory with a 64 MB max. heap size. So I increased the max. heap size to 512 MB. This worked for the 5 MB text file, but Lucene still used 84 MB of heap space to do this. Why so much? The class FreqProxTermsWriterPerField appears to be the biggest memory consumer by far according to

Why does Lucene cause OOM when indexing large files?

走远了吗. 提交于 2019-12-30 09:42:08
问题 I’m working with Lucene 2.4.0 and the JVM (JDK 1.6.0_07). I’m consistently receiving OutOfMemoryError: Java heap space , when trying to index large text files. Example 1: Indexing a 5 MB text file runs out of memory with a 64 MB max. heap size. So I increased the max. heap size to 512 MB. This worked for the 5 MB text file, but Lucene still used 84 MB of heap space to do this. Why so much? The class FreqProxTermsWriterPerField appears to be the biggest memory consumer by far according to

StandardAnalyzer with stemming

半腔热情 提交于 2019-12-30 07:25:17
问题 Is there a way to integrate PorterStemFilter into StandardAnalyzer in Lucene, or do I have to copy/paste StandardAnalyzers source code, and add the filter, since StandardAnalyzer is defined as final class. Is there any smarter way? Also, if I would like not to consider numbers, how can I achieve that? Thanks 回答1: If you want to use this combination for English text analysis, then you should use Lucene's EnglishAnalyzer . Otherwise, you could create a new Analyzer that extends the

StandardAnalyzer with stemming

◇◆丶佛笑我妖孽 提交于 2019-12-30 07:25:07
问题 Is there a way to integrate PorterStemFilter into StandardAnalyzer in Lucene, or do I have to copy/paste StandardAnalyzers source code, and add the filter, since StandardAnalyzer is defined as final class. Is there any smarter way? Also, if I would like not to consider numbers, how can I achieve that? Thanks 回答1: If you want to use this combination for English text analysis, then you should use Lucene's EnglishAnalyzer . Otherwise, you could create a new Analyzer that extends the

Search queries in neo4j: how to sort results in neo4j in START query with internal TFIDF / levenshtein or other algorithms?

北城余情 提交于 2019-12-30 05:32:05
问题 I am working on a model using wikipedia topics' names for my experiments in full-text index. I set up and index on 'topic' (legacy), and do a full text search for : 'united states' : start n=node:topic('name:(united states)') return n The first results are not relevant at all: 'List of United States National Historic Landmarks in United States commonwealths and territories, associated states, and foreign states' [...] and the actual 'united states' is buried deep down the list. As such, it

Timing out a query in Solr

帅比萌擦擦* 提交于 2019-12-30 02:18:15
问题 I hitting queries to solr through a custom developed layer and few queries which i time out in my layer are still in the solr instance. Is there a parameter in solr which can be used to time out an particular query 回答1: As stated in Solr query continues after client disconnects? and written in the Solr FAQ Internally, Solr does nothing to time out any requests -- it lets both updates and queries take however long they need to take to be processed fully. But at the same spot in the FAQ is

Timing out a query in Solr

无人久伴 提交于 2019-12-30 02:18:06
问题 I hitting queries to solr through a custom developed layer and few queries which i time out in my layer are still in the solr instance. Is there a parameter in solr which can be used to time out an particular query 回答1: As stated in Solr query continues after client disconnects? and written in the Solr FAQ Internally, Solr does nothing to time out any requests -- it lets both updates and queries take however long they need to take to be processed fully. But at the same spot in the FAQ is