lucene

How to group results of a Lucene query, count the hits by group and highlight the documents in some selected group?

浪子不回头ぞ 提交于 2021-02-10 14:21:43
问题 I have different types of documents each of which may have multiple authors and upon searching I would like: results to be grouped by author such that I can count the number of documents of each type by each author and use the highlighter to highlight the documents belonging to a selected author. How should I index the documents and search on them to achieve this? Particularly, how to perform grouping when I have multiple authors for a document and the documents are of different types? 来源:

Lucene LongPoint Range search doesn't work

我的梦境 提交于 2021-02-08 10:12:48
问题 I am using Lucene 8.2.0 in Java 11. I am trying to index a Long value so that I can filter by it using a range query, for example like so: +my_range_field:[1 TO 200] . However, any variant of that, even my_range_field:[* TO *] , returns 0 results in this minimal example. As soon as I remove the + from it to make it an OR , I get 2 results. So I am thinking I must make a mistake in how I index it, but I can't make out what it might be. From the LongPoint JavaDoc: An indexed long field for fast

Lucene LongPoint Range search doesn't work

六月ゝ 毕业季﹏ 提交于 2021-02-08 10:08:03
问题 I am using Lucene 8.2.0 in Java 11. I am trying to index a Long value so that I can filter by it using a range query, for example like so: +my_range_field:[1 TO 200] . However, any variant of that, even my_range_field:[* TO *] , returns 0 results in this minimal example. As soon as I remove the + from it to make it an OR , I get 2 results. So I am thinking I must make a mistake in how I index it, but I can't make out what it might be. From the LongPoint JavaDoc: An indexed long field for fast

Term-document matrix in Lucene

懵懂的女人 提交于 2021-02-08 06:51:57
问题 I am trying to get a term-document matrix from Lucene. It seems that most of the SO questions are for outdated APIs with different classes. I tried combining insight from these two questions to get a term vector from every document: Term Vector Frequency in Lucene 4.0 Is it possible to iterate through documents stored in Lucene Index? Relevant code, but DocEnum is not recognized in the current API. How can I get a term vector or count of all terms for every document? IndexReader reader =

Term-document matrix in Lucene

杀马特。学长 韩版系。学妹 提交于 2021-02-08 06:51:54
问题 I am trying to get a term-document matrix from Lucene. It seems that most of the SO questions are for outdated APIs with different classes. I tried combining insight from these two questions to get a term vector from every document: Term Vector Frequency in Lucene 4.0 Is it possible to iterate through documents stored in Lucene Index? Relevant code, but DocEnum is not recognized in the current API. How can I get a term vector or count of all terms for every document? IndexReader reader =

Solr 8.6.3 could not index html file

醉酒当歌 提交于 2021-02-08 06:41:31
问题 solr/ ├── bin/ ├── CHANGES.TXT ├── contrib/ ├── dist/ ├── docs/ ├── example/ ├── licenses ............ ├── server/ └── tempfolder/ └── index.html I have following folder structure and my solr version is 8.6.3. When I enter command: bin/post -c solrhelp -filetypes html tempfolder/ I get following error: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/solrhelp/update/extract?resource.name=/home/user/solr-8.6.3/example/my-examples/index.html&literal.id=/home/user/solr

Solr 8.6.3 could not index html file

断了今生、忘了曾经 提交于 2021-02-08 06:41:26
问题 solr/ ├── bin/ ├── CHANGES.TXT ├── contrib/ ├── dist/ ├── docs/ ├── example/ ├── licenses ............ ├── server/ └── tempfolder/ └── index.html I have following folder structure and my solr version is 8.6.3. When I enter command: bin/post -c solrhelp -filetypes html tempfolder/ I get following error: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/solrhelp/update/extract?resource.name=/home/user/solr-8.6.3/example/my-examples/index.html&literal.id=/home/user/solr

Solr 8.6.3 could not index html file

我们两清 提交于 2021-02-08 06:41:14
问题 solr/ ├── bin/ ├── CHANGES.TXT ├── contrib/ ├── dist/ ├── docs/ ├── example/ ├── licenses ............ ├── server/ └── tempfolder/ └── index.html I have following folder structure and my solr version is 8.6.3. When I enter command: bin/post -c solrhelp -filetypes html tempfolder/ I get following error: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/solrhelp/update/extract?resource.name=/home/user/solr-8.6.3/example/my-examples/index.html&literal.id=/home/user/solr

Looking for libraries which support deduplication on entity

主宰稳场 提交于 2021-02-07 23:01:44
问题 I am going to work on some projects to deal with entity deduplication. Datasets (one or more) which may contain duplicate entity. In the realtime, entity may represent the name, address, country, email, social media id in the different form. My goal is to identify that these are possible duplicates based on different weightage for the different entity Info. I am trying to look for a library that is open-source & preferably written in Java. As I need to process the millions of data, I need to

Prevent “Too Many Clauses” on lucene query

假如想象 提交于 2021-02-07 12:14:44
问题 In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query. I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount(). This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose? In general I feel this is