lucene | 易学教程

Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

阅读更多关于 Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

问题 I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words. I tried to enable stop word filtering with two different approaches. Approach #1: tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet()); tokenStream.reset(); Approach #2: tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)),

Sort by date in Solr/Lucene performance problems

阅读更多关于 Sort by date in Solr/Lucene performance problems

问题 We have set up an Solr index containing 36 million documents (~1K-2K each) and we try to query a maximum of 100 documents matching a single simple keyword. This works pretty fast as we had hoped for. However, if we now add "&sort=createDate+desc" to the query (thus asking for the top 100 'new' documents matching the query) it runs for a long, very long time and finally results in an OutOfMemoryException. From what I've understood from the manual this is caused by the fact that Lucene needs to

Use existing analyzer in hibernate search AnalyzerDiscriminator

阅读更多关于 Use existing analyzer in hibernate search AnalyzerDiscriminator

问题 @Entity @Indexed @AnalyzerDefs({ @AnalyzerDef(name = "en", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = EnglishPorterFilterFactory.class ) }), @AnalyzerDef(name = "de", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = GermanStemFilterFactory.class) }) }) public

how to add custom stop words using lucene in java

阅读更多关于 how to add custom stop words using lucene in java

问题 I am using lucene to remove English Stop words but my requirement is remove English stop words and Custom stop words. Below is my code to remove English stop words using lucene. My Sample Code: public class Stopwords_remove { public String removeStopWords(String string) throws IOException { StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30); TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string)); StringBuilder sb = new StringBuilder();

Full-text search for local/offline web “site” [duplicate]

阅读更多关于 Full-text search for local/offline web “site” [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Full-text search for static HTML files on CD-Rom via javascript I'm starting development of an application that creates a bunch of HTML files locally that can then be browsed in whatever web browser is on the system (including mobile devices) to which they're copied. The HTML files have many interactive features, so it's essentially an offline web app. My question is, what is the best way to implement full-text

Solr - highlight query phrase

阅读更多关于 Solr - highlight query phrase

问题 Is it possible to highlight whole query terms? f.e. when I ask for "United States" I want to get: United States and not: United States I've searched the whole Internet for an answer, used all combinations of hl.mergeContiguous , hl.usePhrasesHighlighter and hl.highlightMultiTerm parameters and still cannot make it work. my query is: http://localhost:8983/solandra/idxPosts.proj350_139/select?q=post_text:"Janusz Palikot"&hl=true&hl.fl=post_text&hl.mergeContiguous=true

ElasticSearch:filtering documents based on field length

阅读更多关于 ElasticSearch:filtering documents based on field length

问题 I read couple of similar problems on SO and suggest solution not work.. I want to find all fields where word is shorter than 8 my database screen: I tried to do this using this query { "query": { "match_all": {} }, "filter": { "script": { "script": "doc['word'].length < 5" } } } what I doing wrong? I miss something? 回答1: Any field used in a script is loaded entirely into memory (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html#_document_fields), so

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

阅读更多关于 Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

问题 When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets With document having text = "CouchDB" When i search for "couc" My highlight is on "cou" and not "couc" It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found. It works fine without analyzing the text with term_vector=with_positions_offsets What's the impact of removing the term_vector=with

Sentence aware search with Lucene SpanQueries

阅读更多关于 Sentence aware search with Lucene SpanQueries

问题 Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence? My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following: SpanQuery termsInSentence = new SpanNearQuery( SpanQuery[] { new SpanTermQuery( new Term (MY_SPECIAL

Solr doesn't overwrite - duplicated uniqueKey entries

阅读更多关于 Solr doesn't overwrite - duplicated uniqueKey entries

问题 I have a problem with Solr 5.3.1 . My Schema is rather simple. I have one uniqueKey which is the "id" as string. indexed, stored and required, non-multivalued. I add documents first with a "content_type:document_unfinished" and then overwrite the same document, with the same id but another content_type:document. The document is then twice in the index. Again, the only uniqueKey is "id", as string. The id is coming originally from a mysql-index primary int. Also looks like this happens not