lucene

Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

断了今生、忘了曾经 提交于 2019-12-21 21:34:10
问题 I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words. I tried to enable stop word filtering with two different approaches. Approach #1: tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet()); tokenStream.reset(); Approach #2: tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)),

Sort by date in Solr/Lucene performance problems

我的梦境 提交于 2019-12-21 20:30:55
问题 We have set up an Solr index containing 36 million documents (~1K-2K each) and we try to query a maximum of 100 documents matching a single simple keyword. This works pretty fast as we had hoped for. However, if we now add "&sort=createDate+desc" to the query (thus asking for the top 100 'new' documents matching the query) it runs for a long, very long time and finally results in an OutOfMemoryException. From what I've understood from the manual this is caused by the fact that Lucene needs to

Use existing analyzer in hibernate search AnalyzerDiscriminator

十年热恋 提交于 2019-12-21 18:34:02
问题 @Entity @Indexed @AnalyzerDefs({ @AnalyzerDef(name = "en", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = EnglishPorterFilterFactory.class ) }), @AnalyzerDef(name = "de", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = GermanStemFilterFactory.class) }) }) public

how to add custom stop words using lucene in java

孤街浪徒 提交于 2019-12-21 17:29:18
问题 I am using lucene to remove English Stop words but my requirement is remove English stop words and Custom stop words. Below is my code to remove English stop words using lucene. My Sample Code: public class Stopwords_remove { public String removeStopWords(String string) throws IOException { StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string)); StringBuilder sb = new StringBuilder();

Full-text search for local/offline web “site” [duplicate]

怎甘沉沦 提交于 2019-12-21 17:28:37
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Full-text search for static HTML files on CD-Rom via javascript I'm starting development of an application that creates a bunch of HTML files locally that can then be browsed in whatever web browser is on the system (including mobile devices) to which they're copied. The HTML files have many interactive features, so it's essentially an offline web app. My question is, what is the best way to implement full-text

Solr - highlight query phrase

二次信任 提交于 2019-12-21 17:01:39
问题 Is it possible to highlight whole query terms? f.e. when I ask for "United States" I want to get: <em>United States</em> and not: <em>United</em> <em>States</em> I've searched the whole Internet for an answer, used all combinations of hl.mergeContiguous , hl.usePhrasesHighlighter and hl.highlightMultiTerm parameters and still cannot make it work. my query is: http://localhost:8983/solandra/idxPosts.proj350_139/select?q=post_text:"Janusz Palikot"&hl=true&hl.fl=post_text&hl.mergeContiguous=true

ElasticSearch:filtering documents based on field length

左心房为你撑大大i 提交于 2019-12-21 16:53:30
问题 I read couple of similar problems on SO and suggest solution not work.. I want to find all fields where word is shorter than 8 my database screen: I tried to do this using this query { "query": { "match_all": {} }, "filter": { "script": { "script": "doc['word'].length < 5" } } } what I doing wrong? I miss something? 回答1: Any field used in a script is loaded entirely into memory (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html#_document_fields), so

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

一曲冷凌霜 提交于 2019-12-21 12:39:55
问题 When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets With document having text = "CouchDB" When i search for "couc" My highlight is on "cou" and not "couc" It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found. It works fine without analyzing the text with term_vector=with_positions_offsets What's the impact of removing the term_vector=with

Sentence aware search with Lucene SpanQueries

青春壹個敷衍的年華 提交于 2019-12-21 12:29:19
问题 Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence? My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following: SpanQuery termsInSentence = new SpanNearQuery( SpanQuery[] { new SpanTermQuery( new Term (MY_SPECIAL

Solr doesn't overwrite - duplicated uniqueKey entries

て烟熏妆下的殇ゞ 提交于 2019-12-21 12:28:21
问题 I have a problem with Solr 5.3.1 . My Schema is rather simple. I have one uniqueKey which is the "id" as string. indexed, stored and required, non-multivalued. I add documents first with a "content_type:document_unfinished" and then overwrite the same document, with the same id but another content_type:document. The document is then twice in the index. Again, the only uniqueKey is "id", as string. The id is coming originally from a mysql-index primary int. Also looks like this happens not