UTF8 encoding is longer than the max length 32766

前端 未结 10 957
鱼传尺愫
鱼传尺愫 2020-11-29 01:39

I\'ve upgraded my Elasticsearch cluster from 1.1 to 1.2 and I have errors when indexing a somewhat big string.

{
  \"error\": \"IllegalArgumentException[Docu         


        
相关标签:
10条回答
  • 2020-11-29 02:30

    One way of handling tokens that are over the lucene limit is to use the truncate filter. Similar to ignore_above for keywords. To demonstrate, I'm using 5. Elasticsearch suggests to use ignore_above = 32766 / 4 = 8191 since UTF-8 characters may occupy at most 4 bytes. https://www.elastic.co/guide/en/elasticsearch/reference/6.3/ignore-above.html

    curl -H'Content-Type:application/json' localhost:9200/_analyze -d'{
      "filter" : [{"type": "truncate", "length": 5}],
      "tokenizer": {
        "type":    "pattern"
      },
      "text": "This movie \n= AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
    }'
    

    Output:

    {
      "tokens": [
        {
          "token": "This",
          "start_offset": 0,
          "end_offset": 4,
          "type": "word",
          "position": 0
        },
        {
          "token": "movie",
          "start_offset": 5,
          "end_offset": 10,
          "type": "word",
          "position": 1
        },
        {
          "token": "AAAAA",
          "start_offset": 14,
          "end_offset": 52,
          "type": "word",
          "position": 2
        }
      ]
    }
    
    0 讨论(0)
  • 2020-11-29 02:30

    In Solr v6+ I changed the field type to text_general and it solved my problem.

    <field name="body" type="string" indexed="true" stored="true" multiValued="false"/>   
    <field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
    
    0 讨论(0)
  • 2020-11-29 02:34

    Using logstash to index those long messages, I use this filter to truncate the long string :

        filter {
            ruby {
                code => "event.set('message_size',event.get('message').bytesize) if event.get('message')"
            }
            ruby {
                code => "
                    if (event.get('message_size'))
                        event.set('message', event.get('message')[0..9999]) if event.get('message_size') > 32000
                        event.tag 'long message'  if event.get('message_size') > 32000
                    end
                "
             }
         }
    

    It adds a message_size field so that I can sort the longest messages by size.

    It also adds the long message tag to those that are over 32000kb so I can select them easily.

    It doesn't solve the problem if you intend to index those long messages completely, but if, like me, don't want to have them in elasticsearch in the first place and want to track them to fix it, it's a working solution.

    0 讨论(0)
  • 2020-11-29 02:35

    I got around this problem by changing my analyzer .

    {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "standard" : {
                        "tokenizer": "standard",
                        "filter": ["standard", "lowercase", "stop"]
                    }
                }
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题