I\'ve upgraded my Elasticsearch cluster from 1.1 to 1.2 and I have errors when indexing a somewhat big string.
{
\"error\": \"IllegalArgumentException[Docu
One way of handling tokens that are over the lucene limit is to use the truncate
filter. Similar to ignore_above
for keywords. To demonstrate, I'm using 5
.
Elasticsearch suggests to use ignore_above = 32766 / 4 = 8191
since UTF-8 characters may occupy at most 4 bytes.
https://www.elastic.co/guide/en/elasticsearch/reference/6.3/ignore-above.html
curl -H'Content-Type:application/json' localhost:9200/_analyze -d'{
"filter" : [{"type": "truncate", "length": 5}],
"tokenizer": {
"type": "pattern"
},
"text": "This movie \n= AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}'
Output:
{
"tokens": [
{
"token": "This",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "movie",
"start_offset": 5,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "AAAAA",
"start_offset": 14,
"end_offset": 52,
"type": "word",
"position": 2
}
]
}
In Solr v6+ I changed the field type to text_general and it solved my problem.
<field name="body" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
Using logstash to index those long messages, I use this filter to truncate the long string :
filter {
ruby {
code => "event.set('message_size',event.get('message').bytesize) if event.get('message')"
}
ruby {
code => "
if (event.get('message_size'))
event.set('message', event.get('message')[0..9999]) if event.get('message_size') > 32000
event.tag 'long message' if event.get('message_size') > 32000
end
"
}
}
It adds a message_size field so that I can sort the longest messages by size.
It also adds the long message tag to those that are over 32000kb so I can select them easily.
It doesn't solve the problem if you intend to index those long messages completely, but if, like me, don't want to have them in elasticsearch in the first place and want to track them to fix it, it's a working solution.
I got around this problem by changing my analyzer .
{
"index" : {
"analysis" : {
"analyzer" : {
"standard" : {
"tokenizer": "standard",
"filter": ["standard", "lowercase", "stop"]
}
}
}
}
}