analyzer | 易学教程

Querying lucene tokens without indexing

阅读更多关于 Querying lucene tokens without indexing

问题 I am using Lucene (or more specifically Compass), to log threads in a forum and I need a way to extract the keywords behind the discussion. That said, I don't want to index every entry someone makes, but rather I'd have a list of 'keywords' that are relevant to a certain context and if the entry matches a keyword and is above a threshold I'd add these entries to the index. I want to be able to use the power of an analyser to strip out things and do its magic, but then return the tokens from

How to analyse Websphere core.dmp file and Snap.trc files?

阅读更多关于 How to analyse Websphere core*.dmp file and Snap*.trc files?

问题 All, I have my application running on websphere app server 7.0. I get some of these core dumps and trace files like core.20110909.164930.3828.0001.dmp and Snap.20110909.164930.3828.0003.trc. My question is, just like the thread dumps generated by WAS can be opened and analyzed by IBM-Thread Dump Analyzer tool is there a tool(s) to open the above mentioned files by IBM or any other? Thanks, Ayusman 回答1: the core dumps have to be processed by the jextract utility (of the jre that dumped) fromn

How to modify standard analyzer to include #?

阅读更多关于 How to modify standard analyzer to include #?

问题 Some characters are treated as delimiters like #, so they would never match in the query. What should be the custom analyzer configuration closest to standard to allow these characters to be matched ? 回答1: 1) Simplest way would be to use whitespace tokenizer with lowercase filter. curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas' which would give you { "tokens" : [ { "token" : "new", "start_offset" : 0, "end_offset" : 3, "type"

Customizing Analyzers in Solr

阅读更多关于 Customizing Analyzers in Solr

问题 In Solr I have a custom Analyzer that has two parameters. I know how to specify this Analyzer in the schema.xml but I'm wondering how I can pass the two arguments either in the schema.xml or runtime in the code. 回答1: You can not pass parameters to the schema xml at run-time , as far as I know. But you can use the reload command.This can be useful when (backwards compatible) changes have been made to your solrconfig.xml or schema.xml files (e.g. new declarations, changed default params for a ,

Why does Lucene QueryParser needs an Analyzer

阅读更多关于 Why does Lucene QueryParser needs an Analyzer

问题 I'm new to Lucene and trying to parse a raw string into a Query using the QueryParser . I was wondering, why is the QueryParser.Parse() method needs an Analyzer parameter at all? If analyzing is something that has to do with querying, then an Analyzer should be specified when dealing with regular Query objects as well ( TermQuery , BooleanQuery etc), and if not, why is QueryParser requires it? 回答1: When indexing, Lucene divides the text into atomic units (tokens). During this phase many

How to add analyzer settings in ElasticSearch?

阅读更多关于 How to add analyzer settings in ElasticSearch?

I am using ElasticSearch 1.5.2 and I wish to have the following settings : "settings": { "analysis": { "filter": { "filter_shingle": { "type": "shingle", "max_shingle_size": 2, "min_shingle_size": 2, "output_unigrams": false }, "filter_stemmer": { "type": "porter_stem", "language": "English" } }, "tokenizer": { "my_ngram_tokenizer": { "type": "nGram", "min_gram": 1, "max_gram": 1 } }, "analyzer": { "ShingleAnalyzer": { "tokenizer": "my_ngram_tokenizer", "filter": [ "standard", "lowercase", "filter_stemmer", "filter_shingle" ] } } } } Where should I add them? I mean before index creation or

Elasticsearch synonym analyzer not working

阅读更多关于 Elasticsearch synonym analyzer not working

问题 EDIT: To add on to this, the synonyms seem to be working with basic querystring queries. "query_string" : { "default_field" : "location.region.name.raw", "query" : "nh" } This returns all of the results for New Hampshire, but a "match" query for "nh" returns no results. I'm trying to add synonyms to my location fields in my Elastic index, so that if I do a location search for "Mass," "Ma," or "Massachusetts" I'll get the same results each time. I added the synonyms filter to my settings and

Obj-, Instance variable used when 'self' is not set to the result of '[(super or self) init…]'

阅读更多关于 Obj-, Instance variable used when 'self' is not set to the result of '[(super or self) init…]'

I asked a similar question to this already, but I still can't see the problem? -(id)initWithKeyPadType: (int)value { [self setKeyPadType:value]; self = [self init]; if( self != nil ) { //self.intKeyPadType = value; } return self; } - (id)init { NSNumberFormatter *formatter = [[[NSNumberFormatter alloc] init] autorelease]; decimalSymbol = [formatter decimalSeparator]; .... The warning comes from the line above Instance variable used while 'self' is not set to the result of '[(super or self) init...]' What you are trying to do is technically OK, but at some stage you need to invoke [super init]

Elasticsearch count terms ignoring spaces

阅读更多关于 Elasticsearch count terms ignoring spaces

问题 Using ES 1.2.1 My aggregation { "size": 0, "aggs": { "cities": { "terms": { "field": "city","size": 300000 } } } } The issue is that some city names have spaces in them and aggregate separately. For instance Los Angeles { "key": "Los", "doc_count": 2230 }, { "key": "Angeles", "doc_count": 2230 }, I assume it has to do with the analyzer? Which one would I use to not split on spaces? 回答1: For fields that you want to perform aggregations on I would recommend either the keyword analyzer or do not

what is the best lucene setup for ranking exact matches as the highest

阅读更多关于 what is the best lucene setup for ranking exact matches as the highest

Which analyzers should be used for indexing and for searching when I want an exact match to rank higher then a "partial" match? Possibly set up custom scoring in a Similarity class? For example, when my index consist of car parts , car , and car shop (indexed with StandardAnalyzer on lucene 3.5), a query for "car" results in: car parts car car shop (basically returned in the order in which they were added, since they all get the same score). What I would like to see is car ranked first, then the other results (doesn't really matter which order, I assume the analyzer can influence that). All